Find, categorize, and fix error modes across many traces using Skills, Alyx, or manual annotation
You can explore individual traces and see what happened in a single request. But any non-trivial LLM app (agent, RAG, or chatbot) can fail in many different ways on the same input. When something goes wrong, the problem could be anywhere in the chain.This page shows three ways to do error analysis across many traces at once (Skills in your coding agent, Alyx in the product, or manual annotation) and how to validate a fix once you’ve found a pattern.
A single failing trace is a data point. It can surface a pattern you actually want to fix. Three approaches get you from failures to labeled error modes. The manual approach walks through annotation and clustering step by step; Skills and Alyx use AI to produce categories directly.
Arize Skills
Alyx
Manually
Run the arize-trace skill in your AI coding agent (Claude Code, Cursor, Codex, Windsurf, and 40+ others) to export traces and analyze them locally.Install skill
# Ask your AI coding agent:"Find me traces that have long latency,and summarize what they have in common."
Other prompts to try: “why are these traces failing”, “group last night’s errors by root cause”. The skill runs ax traces export under the hood; your agent reads the output and reports patterns.What you get back: a written summary of common patterns across the exported traces. Fast to run. Best for quickly spotting what is going wrong before you commit to deeper analysis.
Alyx is the AI assistant built into the product. Click the Alyx button in the top-right of any Traces view to open it, then ask for categories directly:
“Find me the common types of questions users are asking”
“What are the most common error types in the last 24 hours?”
“Group failing traces by root cause”
What you get back: a Categorized Spans widget with category names, counts, short descriptions, and example spans you can click into. You can also apply the categories back to your spans as annotations. Great for a fast, structured first pass across large volumes of traces.
This is the deepest approach. A small pipeline: traces in → human annotations → named error modes → an evaluator out. Skills and Alyx give you fast coverage; manual annotation gives you categories you can stake a fix on.
1
Open your traces
Filter to failures in Explore your traces. Plan to read 20–50 failing traces.Output: a filtered list of failing traces to work through.
2
Free-form annotate
For each trace, write a short free-form note about what went wrong. Don’t force it into buckets yet; just capture what’s weird. Examples: “not specific enough”, “made up info”, “too verbose”, “called wrong tool”.Use Human Annotations to capture these notes directly on the traces. Once patterns emerge, formalize them as labels in an annotation config and apply them to matching traces.Output: a set of human annotations attached to real traces, the raw material for clustering.
3
Cluster into error modes
Group your labeled annotations into error modes. Several notes about bad hotel IDs might cluster into a named pattern such as invalid_hotel_id. Expect 3–5 such patterns in total. Others could be tool_call_hallucination, retrieval_miss, or agent_loop.Track the rate of each error mode as a custom metric so you can see whether a fix actually lowers it.Output: 3–5 named error modes, each tracked as a custom metric so you can watch the rate over time.
4
Design evals from the error modes
For each labeled error mode, write an eval that catches it automatically (LLM-as-judge or a code check). The labeled traces become your regression suite. See Evaluators.Output: one evaluator per error mode, with the labeled traces as its regression suite.
Skills and Alyx are the fastest way to do a first pass across large volumes of traces. Manual annotation takes longer but typically produces sharper, more reliable categories. Use it on the error modes that matter most.
If you used Skills or Alyx, pick the most impactful pattern from the output and give it a name (e.g., invalid_hotel_id). Then validate a fix before shipping:
Save failing traces to a dataset: those traces become your regression suite for future prompt tweaks, retrieval updates, tool changes, or any other fix.
Run an experiment: compare the old vs new version against the dataset; ship the fix only if the experiment shows improvement without regressions on other examples.
After shipping, track the rate of this error mode as a custom metric and put a monitor on it so you’re alerted if it comes back.