See Agent Insights - Arize AX Docs

You can explore individual traces and see what happened in a single request. But any non-trivial LLM app (agent, RAG, or chatbot) can fail in many different ways on the same input. When something goes wrong, the problem could be anywhere in the chain. For scheduled issue detection from production traces, see Get started with Signal. This page shows three ways to do error analysis across many traces at once (Skills in your coding agent, Alyx in the product, or manual annotation) and how to validate a fix once you’ve found a pattern.

Error analysis

A single failing trace is a data point. It can surface a pattern you actually want to fix. Three approaches get you from failures to labeled error modes. The manual approach walks through annotation and clustering step by step; Skills and Alyx use AI to produce categories directly.

Arize Skills
Alyx
Manually

Run the arize-trace skill in your AI coding agent (Claude Code, Cursor, Codex, Windsurf, and 40+ others) to export traces and analyze them locally.Install skill

npx skills add Arize-ai/arize-skills --skill "arize-trace" --yes

Set up authentication

export ARIZE_API_KEY="YOUR_API_KEY"
export ARIZE_SPACE_ID="YOUR_SPACE_ID"

Ask your agent

# Ask your AI coding agent:
"Find me traces that have long latency,
and summarize what they have in common."

Other prompts to try: “why are these traces failing”, “group last night’s errors by root cause”. The skill runs ax traces export under the hood; your agent reads the output and reports patterns.What you get back: a written summary of common patterns across the exported traces. Fast to run. Best for quickly spotting what is going wrong before you commit to deeper analysis.

Claude Code running the arize-trace skill to find traces with errors or latency over 5000ms in the skyserve-chatbot project

This is the deepest approach. A small pipeline: traces in → human annotations → named error modes → an evaluator out. Skills and Alyx give you fast coverage; manual annotation gives you categories you can stake a fix on.

Open your traces

Filter to failures in Explore your traces. Plan to read 20–50 failing traces.Output: a filtered list of failing traces to work through.

Free-form annotate

For each trace, write a short free-form note about what went wrong. Don’t force it into buckets yet; just capture what’s weird. Examples: “not specific enough”, “made up info”, “too verbose”, “called wrong tool”.Use Human Annotations to capture these notes directly on the traces. Once patterns emerge, formalize them as labels in an annotation config and apply them to matching traces.Output: a set of human annotations attached to real traces, the raw material for clustering.

Cluster into error modes

Group your labeled annotations into error modes. Several notes about bad hotel IDs might cluster into a named pattern such as invalid_hotel_id. Expect 3–5 such patterns in total. Others could be tool_call_hallucination, retrieval_miss, or agent_loop.Track the rate of each error mode as a custom metric so you can see whether a fix actually lowers it.Output: 3–5 named error modes, each tracked as a custom metric so you can watch the rate over time.

Design evals from the error modes

For each labeled error mode, write an eval that catches it automatically (LLM-as-judge or a code check). The labeled traces become your regression suite. See Evaluators.Output: one evaluator per error mode, with the labeled traces as its regression suite.

Skills and Alyx are the fastest way to do a first pass across large volumes of traces. Manual annotation takes longer but typically produces sharper, more reliable categories. Use it on the error modes that matter most.

Fix the error mode

If you used Skills or Alyx, pick the most impactful pattern from the output and give it a name (e.g., invalid_hotel_id). Then validate a fix before shipping:

Save failing traces to a dataset: those traces become your regression suite for future prompt tweaks, retrieval updates, tool changes, or any other fix.
Run an experiment: compare the old vs new version against the dataset; ship the fix only if the experiment shows improvement without regressions on other examples.

After shipping, track the rate of this error mode as a custom metric and put a monitor on it so you’re alerted if it comes back.

Next step

You’ve found your error modes. Now turn their rates into metrics so you can track whether fixes actually move them:

​Error analysis

​Fix the error mode

​Next step

Next: Set Up Custom Metrics

Error analysis

Fix the error mode

Next step