Skip to main content
You can explore individual traces and see what happened in a single request. But any non-trivial LLM app (agent, RAG, or chatbot) can fail in many different ways on the same input. When something goes wrong, the problem could be anywhere in the chain. This page shows three ways to do error analysis across many traces at once (Skills in your coding agent, Alyx in the product, or manual annotation) and how to validate a fix once you’ve found a pattern.

Error analysis

A single failing trace is a data point. It can surface a pattern you actually want to fix. Three approaches get you from failures to labeled error modes. The manual approach walks through annotation and clustering step by step; Skills and Alyx use AI to produce categories directly.
Error analysis flow: production traces feed into a single annotate step (via Skills, Alyx, or UI), producing named failure patterns, which then feed into validating a fix and setting up a monitor
Run the arize-trace skill in your AI coding agent (Claude Code, Cursor, Codex, Windsurf, and 40+ others) to export traces and analyze them locally.Install skill
npx skills add Arize-ai/arize-skills --skill "arize-trace" --yes
Set up authentication
export ARIZE_API_KEY="YOUR_API_KEY"
export ARIZE_SPACE_ID="YOUR_SPACE_ID"
Ask your agent
# Ask your AI coding agent:
"Find me traces that have long latency,
and summarize what they have in common."
Other prompts to try: “why are these traces failing”, “group last night’s errors by root cause”. The skill runs ax traces export under the hood; your agent reads the output and reports patterns.What you get back: a written summary of common patterns across the exported traces. Fast to run. Best for quickly spotting what is going wrong before you commit to deeper analysis.
Claude Code running the arize-trace skill to find traces with errors or latency over 5000ms in the skyserve-chatbot project
Skills and Alyx are the fastest way to do a first pass across large volumes of traces. Manual annotation takes longer but typically produces sharper, more reliable categories. Use it on the error modes that matter most.

Fix the error mode

If you used Skills or Alyx, pick the most impactful pattern from the output and give it a name (e.g., invalid_hotel_id). Then validate a fix before shipping:
  • Save failing traces to a dataset: those traces become your regression suite for future prompt tweaks, retrieval updates, tool changes, or any other fix.
  • Run an experiment: compare the old vs new version against the dataset; ship the fix only if the experiment shows improvement without regressions on other examples.
After shipping, track the rate of this error mode as a custom metric and put a monitor on it so you’re alerted if it comes back.

Next step

You’ve found your error modes. Now turn their rates into metrics so you can track whether fixes actually move them:

Next: Set Up Custom Metrics