The improvement loop
Your agent produces a bad output. You find out through a support ticket, a user complaint, or by systematically scanning for problems. Arize AX gives you everything you need to get from that moment to a fix: a centralized place to observe what happened, understand why it went wrong, run controlled experiments, and ship with confidence. Each step below maps to a part of the platform and they are designed to feed into each other.Observe
Every run generates traces and spans capturing the full record of what happened: inputs, tool calls, model outputs, latency, token counts, and more. This is the starting point for everything else. You cannot improve what you cannot see. From here, the question becomes: which of these runs went wrong?
Annotate and evaluate
Reviewing traces and flagging bad outputs builds a record of what failure looks like in your agent. Over time that record becomes a golden dataset and a starting point for figuring out which error patterns are common enough to automate using evals. See Human review and Labeling queues. Once you know which outputs are bad, you need to understand why.
Hypothesize
Trace analysis lets you look across many runs to find the pattern before you change anything. Is retrieval pulling the wrong chunks for a particular query type? Is a tool call failing upstream and corrupting everything downstream? A clear hypothesis is the difference between an experiment and a guess. Alyx and AI Search can help you explore patterns at scale. With a hypothesis in hand, you can test it properly.
Experiment on your agent
Change the prompt, swap the model, adjust the retrieval config, and run it against a curated dataset. Arize AX tracks every variant so you can compare results directly. The experiment tells you what changed. The Measure step tells you whether that change was actually better.
Measure
You define evals so every variant is scored against the same criteria, either using an LLM-as-a-judge or a code evaluator. Scores are surfaced in the platform so you can see which variant actually performed better, and by how much.
Apply or iterate
Once your experiment results look good, you apply the change to your agent. If they do not, you go back to hypothesizing with what you have learned. Either way, the work does not stop at deployment. Online evals keep running against live production traffic after you ship, using the same scoring criteria you defined during development. When a new failure pattern emerges, it surfaces as a new trace and the loop starts again from the top.Agents at every step
At each step, an agent can help execute the work. Every part of the loop is accessible via the UI, code, Alyx, or the AX CLI, so you can choose how much to automate and where.Alyx
The AI engineering agent built into Arize AX- Ask questions in natural language: which tool calls are failing, what users are asking that the agent cannot handle, where response times are spiking
- Executes against your data directly in the UI and tells you what to look at
AX CLI
The AX CLI makes AX features available directly inside your coding agent. Instead of switching to the browser, your agent can query your traces, spans, experiments, and datasets from within your editor and use that context to help you debug and improve your agent. Use it to do things like:- Surface the most common failure patterns across recent traces
- Identify which tool calls are failing and why
- Pull experiment results and compare variants
- Query datasets to build context for a fix