> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# See Agent Insights

> Find, categorize, and fix error modes across many traces using Skills, Alyx, or manual annotation

You can explore individual traces and see what happened in a single request. But any non-trivial LLM app (agent, RAG, or chatbot) can fail in many different ways on the same input. When something goes wrong, the problem could be anywhere in the chain.

For scheduled issue detection from production traces, see [Get started with Signal](/ax/agents/get-started-with-signal).

This page shows three ways to do error analysis across many traces at once (Skills in your coding agent, Alyx in the product, or manual annotation) and how to validate a fix once you've found a pattern.

## Error analysis

A single failing trace is a data point. It can surface a pattern you actually want to fix. Three approaches get you from failures to labeled error modes. The manual approach walks through annotation and clustering step by step; Skills and Alyx use AI to produce categories directly.

<Frame>
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/observe/agent-insights-overview.png" alt="Error analysis flow: production traces feed into a single annotate step (via Skills, Alyx, or UI), producing named failure patterns, which then feed into validating a fix and setting up a monitor" />
</Frame>

<Tabs>
  <Tab title="Arize Skills">
    Run the `arize-trace` skill in your AI coding agent (Claude Code, Cursor, Codex, Windsurf, and 40+ others) to export traces and analyze them locally.

    **Install skill**

    ```bash theme={null}
    npx skills add Arize-ai/arize-skills --skill "arize-trace" --yes
    ```

    **Set up authentication**

    ```bash theme={null}
    export ARIZE_API_KEY="YOUR_API_KEY"
    export ARIZE_SPACE_ID="YOUR_SPACE_ID"
    ```

    **Ask your agent**

    ```
    # Ask your AI coding agent:
    "Find me traces that have long latency,
    and summarize what they have in common."
    ```

    Other prompts to try: *"why are these traces failing"*, *"group last night's errors by root cause"*. The skill runs `ax traces export` under the hood; your agent reads the output and reports patterns.

    **What you get back:** a written summary of common patterns across the exported traces. Fast to run. Best for quickly spotting *what* is going wrong before you commit to deeper analysis.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/observe/skill_error_analysis.png" alt="Claude Code running the arize-trace skill to find traces with errors or latency over 5000ms in the skyserve-chatbot project" />
    </Frame>
  </Tab>

  <Tab title="Alyx">
    **Alyx** is the AI assistant built into the product. Click the Alyx button in the top-right of any Traces view to open it, then ask for categories directly:

    * "Find me the common types of questions users are asking"
    * "What are the most common error types in the last 24 hours?"
    * "Group failing traces by root cause"

    **What you get back:** a **Categorized Spans** widget with category names, counts, short descriptions, and example spans you can click into. You can also apply the categories back to your spans as annotations. Great for a fast, structured first pass across large volumes of traces.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/observe/alyx_agent_insight.png" alt="Alyx returning a Categorized Spans widget with category names, counts, descriptions, and example span IDs" />
    </Frame>
  </Tab>

  <Tab title="Manually">
    This is the deepest approach. A small pipeline: **traces in → human annotations → named error modes → an evaluator out**. Skills and Alyx give you fast coverage; manual annotation gives you categories you can stake a fix on.

    <Steps>
      <Step title="Open your traces">
        Filter to failures in [Explore your traces](/ax/observe/tracing/view-and-manage-traces). Plan to read 20–50 failing traces.

        **Output:** a filtered list of failing traces to work through.
      </Step>

      <Step title="Free-form annotate">
        For each trace, write a short free-form note about what went wrong. Don't force it into buckets yet; just capture what's weird. Examples: *"not specific enough"*, *"made up info"*, *"too verbose"*, *"called wrong tool"*.

        Use [Human Annotations](/ax/evaluate/human-annotations) to capture these notes directly on the traces. Once patterns emerge, formalize them as labels in an annotation config and apply them to matching traces.

        **Output:** a set of **human annotations** attached to real traces, the raw material for clustering.
      </Step>

      <Step title="Cluster into error modes">
        Group your labeled annotations into error modes. Several notes about bad hotel IDs might cluster into a named pattern such as `invalid_hotel_id`. Expect 3–5 such patterns in total. Others could be `tool_call_hallucination`, `retrieval_miss`, or `agent_loop`.

        Track the rate of each error mode as a [custom metric](/ax/observe/projects/custom-metrics-api) so you can see whether a fix actually lowers it.

        **Output:** 3–5 named error modes, each tracked as a **custom metric** so you can watch the rate over time.
      </Step>

      <Step title="Design evals from the error modes">
        For each labeled error mode, write an eval that catches it automatically (LLM-as-judge or a code check). The labeled traces become your regression suite. See [Evaluators](/ax/evaluate/evaluators).

        **Output:** one **evaluator per error mode**, with the labeled traces as its regression suite.
      </Step>
    </Steps>
  </Tab>
</Tabs>

<Info>
  Skills and Alyx are the fastest way to do a first pass across large volumes of traces. Manual annotation takes longer but typically produces sharper, more reliable categories. Use it on the error modes that matter most.
</Info>

## Fix the error mode

If you used Skills or Alyx, pick the most impactful pattern from the output and give it a name (e.g., `invalid_hotel_id`). Then validate a fix before shipping:

* **Save failing traces to a dataset**: those traces become your regression suite for future prompt tweaks, retrieval updates, tool changes, or any other fix.
* **[Run an experiment](/ax/develop/datasets-and-experiments)**: compare the old vs new version against the dataset; ship the fix only if the experiment shows improvement without regressions on other examples.

After shipping, track the rate of this error mode as a [custom metric](/ax/observe/projects/custom-metrics-api) and put a [monitor](/ax/observe/production-monitoring) on it so you're alerted if it comes back.

***

## Next step

You've found your error modes. Now turn their rates into metrics so you can track whether fixes actually move them:

<Card title="Next: Set Up Custom Metrics" icon="arrow-right" href="/ax/observe/projects/custom-metrics-api" />
