Continuously investigate production agent failures with Signal
June 5, 2026 New Agents Signal is an always-on AI worker that continuously reviews new production traces, identifies recurring failure patterns, and groups related traces into investigation reports. Each report includes a summary, root-cause analysis, estimated impact, and suggested next steps.
Orchestrate repo-aware managed agents across your engineering workflows
June 5, 2026 New Agents Arize AX now supports orchestrating long-running, repo-aware managed agents that inspect traces, access external systems, analyze code, create investigation artifacts, and propose changes as pull requests for human review. Teams can start with prebuilt workflows or build their own using configurable harnesses, sandboxes, repositories, skills, and automations. Repetitive investigation and analysis work, regression triage, debugging production behavior, dataset curation, eval generation, security reviews, and code remediation, can now be delegated to agents that gather evidence and propose actions, while engineers retain full control to review, approve, modify, or reject every proposal. Refer to the managed agents documentation to learn more.Track managed agents in your organization from a single view
June 5, 2026 New Agents Agent Fleet Observability provides a centralized view of all managed agents across your organization, including status, activity, trajectories, token usage, and cost. As teams deploy more AI workers, visibility into what those agents are doing, how they are performing, and what resources they are consuming becomes essential. You can track long-running investigations, understand agent behavior over time, and manage your growing fleet without switching between surfaces. Check out the Agent Studio documentation to learn more.Agent Experimentation surfaces behavioral diffs across your complete agent system
June 5, 2026 New Agents Agent Experimentation lets you run curated datasets through your entire agent system and compare outputs, traces, and evaluation scores across runs. Production agents are systems made of tools, retrieval, routing, memory, models, fallbacks, application code, and business logic. A change in any layer can improve one behavior while breaking another. Each run surfaces the behavioral diff: whether tool use improved, latency shifted, retrieval quality held, or a fix for one failure introduced a regression elsewhere. Visit the agent experiments documentation to learn more.Adapt your evaluation criteria to emerging agent failures with Harness-as-a-Judge
June 5, 2026 New Agents Harness-as-a-Judge lets you describe good behavior, then uses an agentic judge to inspect traces, identify relevant spans, classify issues, and generate labels for future monitoring, evaluation, and experimentation. Traditional LLM judges work best when you already know what to measure. Agent failures are rarely that predictable. New failure modes emerge as agents interact with tools, users, and changing environments in production. Harness-as-a-Judge produces evaluation signal that adapts to what is actually breaking, not just what was anticipated. Refer to the Harness-as-a-Judge documentation to learn more.Observe, replay, and evaluate voice agent conversations natively
June 5, 2026 New Agents Arize AX now provides native support for observing, searching, replaying, and evaluating voice agent conversations. You can inspect audio sessions alongside transcripts and traces, analyze interruptions and time-to-first-audio, replay conversations end-to-end, and run evaluations directly against audio interactions. Voice agents introduce complexity beyond text. Audio streams, interruptions, speech latency, transcription quality, and multimodal interactions all become part of the agent experience. Native voice support brings voice conversations into the same observability and evaluation workflow as text agents, so you can debug, monitor, and improve conversational AI systems with the same rigor you apply to the rest of your stack. Check out the tracing and evaluating audio documentation to learn more.Design custom views for annotation queues with Alyx
June 5, 2026 New Alyx You can now give annotators exactly the view they need: describe the layout you want in plain language, and Alyx generates a custom React view for your annotation queue records with color coding, visualizations, or any presentation that fits your workflow. Views can surface specific attributes cleanly, and you can publish a view org-wide or push it to other queues. Custom Views are also available in the trace slideover for spans and traces.
Monitor org-wide traces, costs, and evals at a glance
June 5, 2026 Improvement Dashboards and Visualizations You can now monitor traces, errors, latency, cost, and eval scores across your entire organization from a single dashboard, with six new widgets, a logical two-column layout, and a customization panel that autosaves your changes.
- Trace Count Over Time: stacked bar chart of root span volume across spaces, with per-day top-project tooltip drilldowns
- Errors Over Time: stacked bar chart of spans with
status_code=ERROR, broken down by space and top contributing projects - Average Latency Over Time: line chart of average latency by space with per-project tooltip drilldowns
- Total Cost Over Time: stacked bar chart of daily cost aggregates by space
- Average Eval Score Over Time: line chart with an org-level eval picker across all LLM projects
- Distribution of Eval Labels: bar chart of eval label distributions with eval and scope selection