> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluate Your Agent

> Set up automated evaluations so every response is scored without manual review

In the [previous guide](/ax/get-started/get-started-tracing), you instrumented your chatbot and explored its traces. That works for a handful of test queries - but you can't read every response yourself.

[**Evaluations**](/ax/evaluate/evals-overview) solve this. An evaluation is an automated check - either an LLM judging another LLM's output, or a deterministic code check - that runs on your production data continuously. By the end of this guide, every response will be automatically scored and you'll be able to filter to find the ones that need attention.

<Frame caption="Evaluate trace data">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/evals-overview.gif" alt="Evaluations overview showing trace list with evaluation score columns and filters" />
</Frame>

<Info>
  This is **Part 2** of the Arize AX Get Started series. You should have completed the [Tracing guide](/ax/get-started/get-started-tracing) first, with traces flowing into your **skyserve-chatbot** project.
</Info>

## Choose how you want to work

Use [Arize Skills](/ax/agents/arize-skills) to have your coding agent run [evaluations](/ax/evaluate/evals-overview) from your editor, [Alyx](/ax/alyx) for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or **Code** to run them programmatically.

<Tabs>
  <Tab title="By Arize Skills">
    Use [Arize Skills](/ax/agents/arize-skills) with your coding agent to create an [evaluator](/ax/evaluate/create-evaluators), run it on traces as a [task](/ax/evaluate/run-evals-on-traces#create-a-task), and export spans to inspect failures. Install the skills plugin and follow [Set up Arize with AI coding agents](/ax/set-up-with-ai-assistants) for authentication and CLI setup. Then, follow the flow below.

    ### Step 1: Create eval

    [`arize-evaluator`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md)

    The skill covers **[LLM-as-a-Judge evaluators](/ax/evaluate/create-evaluators)** only. In your prompt, say which template you need (for example Hallucination) and which project the evaluator is for. For example, you might say:

    > Create a groundedness-check evaluator using the Hallucination template for skyserve-chatbot. The input column is question, the output column is response, and the context column is retrieved\_documents.

    Note that templates are a starting point - most teams customize the prompt criteria to match their specific rubric. Once the evaluator is created, you can ask your agent to revise it, such as:

    > Update the groundedness-check evaluator to only flag hallucinations when the claim contradicts a retrieved document, not when it's just unsupported.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/loaded%20skill-1.png" alt="Terminal showing Claude Code loading the arize-evaluator skill after a prompt to create a groundedness-check evaluator with the Hallucination template, column mapping, and follow-up revision for skyserve-chatbot." />
    </Frame>

    ### Step 2: Create a task to run your evaluator

    [`arize-evaluator`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md)

    A [task](/ax/evaluate/run-evals-on-traces#create-a-task) connects an [evaluator](/ax/evaluate/create-evaluators) to your project and defines cadence and sampling. See [Run online evals on traces](/ax/evaluate/run-evals-on-traces) for the full UI and configuration options.

    For example, you might say:

    > Set up a task to run groundedness-check continuously on incoming traces.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/Screenshot%202026-04-23%20at%202.35.26%E2%80%AFPM.png" alt="Coding agent terminal: choosing how an evaluator task runs—backfill, continuous on new spans, or both—then creating the evaluator and task with CLI commands" />
    </Frame>

    ### Step 3: See evaluation results on your traces

    [`arize-trace`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-trace/SKILL.md)

    After an [eval task](/ax/evaluate/run-evals-on-traces#create-a-task) has written labels to spans, export failures for triage. See [Viewing results](/ax/evaluate/run-evals-on-traces#viewing-results) for where scores appear in the UI.

    For example, you might say:

    > Export spans from skyserve-chatbot where groundedness-check failed this week
  </Tab>

  <Tab title="By Alyx">
    Open [Alyx](/ax/alyx) anywhere in the Arize platform and describe what you want in plain text. Alyx will guide you through each step - creating [evaluators](/ax/evaluate/create-evaluators), setting up [tasks](/ax/evaluate/run-evals-on-traces#create-a-task), and analyzing results - conversationally.

    ### Step 1: Create eval

    Describe what you need in plain text for an **LLM-as-a-Judge** [evaluator](/ax/evaluate/create-evaluators) (for example a Hallucination template), including eval name, judge model, and which span columns map to input, output, and context. For **code-based evaluators**, use **Evaluators** → **New Evaluator** → **Code** in the UI or see [Code evaluations](/ax/evaluate/evaluators/code-evaluations). For example, you might say:

    > Create a groundedness-check hallucination evaluator using GPT-4o as the judge. The input column is question, the output column is response, and the context column is retrieved\_documents.

    Note that templates are a starting point - most teams customize the prompt criteria to match their specific rubric. Once the evaluator is created, you can ask Alyx to revise it, such as:

    > Update the groundedness-check evaluator to only flag hallucinations when the claim contradicts a retrieved document, not when it's just unsupported. The input column is question, the output column is response, and the context column is retrieved\_documents.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/alyx-createevalground.png" alt="Alyx chat creating a groundedness-check hallucination evaluator with GPT-4o as judge, column mapping for question, response, and retrieved_documents, and EXPLANATION and LABEL output format." />
    </Frame>

    ### Step 2: Create a task to run your evaluator

    This mirrors creating a [task](/ax/evaluate/run-evals-on-traces#create-a-task) in the Evaluators UI. See [Run online evals on traces](/ax/evaluate/run-evals-on-traces).

    For example, you might say:

    > Create a task to run groundedness-check on skyserve-chatbot continuously on new traces at 100% sampling

    ### Step 3: See evaluation results on your traces

    Alyx can filter on the same [evaluation](/ax/evaluate/run-evals-on-traces#viewing-results) fields that [tasks](/ax/evaluate/run-evals-on-traces#create-a-task) write to each trace.

    For example, you might say:

    > Show me traces that failed the groundedness check this week and explain what went wrong

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/Screenshot%202026-04-23%20at%202.44.16%E2%80%AFPM.png" alt="Alyx on the Traces view for skyserve-chatbot: user asks for traces that failed the groundedness check this week; Alyx proposes filters and explains eval labels" />
    </Frame>
  </Tab>

  <Tab title="By UI">
    Use the Arize AX UI in **Evaluators** and **Traces** to configure **skyserve-chatbot** [evals](/ax/evaluate/evals-overview) end to end.

    ### Step 1: Understand the two types of evaluations

    Arize AX supports two kinds of evaluators. **[LLM-as-a-Judge evaluators](/ax/evaluate/create-evaluators)** use an LLM to assess quality - great for subjective dimensions like helpfulness or groundedness that are hard to check with code. **Code-based** [evaluators](/ax/evaluate/create-evaluators) are deterministic Python checks, ideal for objective conditions like empty responses or keyword presence. We'll focus on LLM-as-a-Judge for this example. For code-based evaluators, see [Create evaluators](/ax/evaluate/create-evaluators).

    ### Step 2: Create an LLM-as-a-Judge evaluator

    This example uses **LLM-as-a-Judge** only. In the left sidebar, click **Evaluators**, then **New Evaluator**. Select **LLM-as-a-Judge** and choose the **Hallucination** template — this checks whether the response contains claims not supported by the retrieved context.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/05-eval-hub-select-template.png" alt="Evaluator template selection showing Hallucination, Relevance, and other templates" />
    </Frame>

    1. Give it a name: `groundedness-check`
    2. Select your LLM provider and model (e.g., OpenAI GPT-4o)
    3. Review the template and customize the criteria to match your rubric — or leave it as-is to get started quickly
    4. Click **Create Evaluator**

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/06-eval-hubconfigure.png" alt="Configuring the groundedness-check evaluator with LLM provider and template" />
    </Frame>

    ### Step 3: Create a task to run your evaluator

    An evaluator on its own is just a template. To run it on your data, create a [**task**](/ax/evaluate/run-evals-on-traces#what-is-a-task) — an automation that applies your evaluator to incoming traces.

    1. Click **New Task** and select **LLM-as-a-Judge**
    2. Click **Add Evaluator** and select `groundedness-check`
    3. Set data source to your **skyserve-chatbot** project, cadence to **Run continuously on new incoming data**, and sampling to **100%**
    4. Map your span attributes to the template variables
    5. Click **Create Task**

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/07-eval-task-variable_mapping.png" alt="Variable mapping panel showing template variables mapped to span attributes" />
    </Frame>

    <br />

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/08-eval-task_running.png" alt="Running Eval Tasks tab showing the groundedness-check task active" />
    </Frame>

    ### Step 4: See evaluation results on your traces

    Wait a couple of minutes, then go back to your **skyserve-chatbot** project. You'll see [evaluation](/ax/evaluate/run-evals-on-traces#viewing-results) scores on each trace. Filter by score to find failures and click into any trace to see the evaluator's label, score, and explanation.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/10-traces-with-eval-scores.png" alt="Traces list with evaluation score columns showing factual and hallucinated labels" />
    </Frame>

    <br />

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/11-trace-with-eval-detail.png" alt="Trace detail showing hallucinated label with judge explanation" />
    </Frame>
  </Tab>

  <Tab title="By Code">
    Run this workflow from the [Python SDK](/api-clients/python/overview), [TypeScript SDK](/api-clients/typescript/version-1/overview), or [`ax` CLI](/api-clients/cli/overview). Some features are in alpha or beta - please check individual reference pages for details.

    | Step                                  | Python SDK                                                                            | TypeScript SDK                                                                            | CLI                                                      |
    | ------------------------------------- | ------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | -------------------------------------------------------- |
    | Create an evaluator                   | [Link](/api-clients/python/version-8/client-resources/evaluators#create-an-evaluator) | [Link](/api-clients/typescript/version-1/client-resources/evaluators#create-an-evaluator) | [Link](/api-clients/cli/evaluators#ax-evaluators-create) |
    | Create a task to run your evaluator   | [Link](/api-clients/python/version-8/client-resources/tasks#create-a-task)            | [Link](/api-clients/typescript/version-1/client-resources/tasks#create-a-task)            | [Link](/api-clients/cli/tasks#ax-tasks-create)           |
    | See evaluation results on your traces | [Link](/api-clients/python/version-8/client-resources/spans#filter-spans)             | [Link](/api-clients/typescript/version-1/client-resources/spans#list-spans)               | [Link](/api-clients/cli/spans#ax-spans-export)           |
  </Tab>
</Tabs>

## Congratulations!

Every response your chatbot generates is now automatically scored for quality. You've gone from *"I think it's working"* to *"I can measure exactly how well it's working."* Instead of manually reviewing traces, you can filter to just the ones that failed — and you have an explanation of what went wrong.

Your [evaluations](/ax/evaluate/evals-overview) have probably revealed a pattern: some responses score poorly because the chatbot makes claims that aren't in the policy documents. The system prompt says "be helpful," but it doesn't say "only use information from the provided documents." That's a prompt problem — and it's exactly what we'll fix next.

**Next up:** We'll walk through how to improve your agent using Arize's Prompt Playground and Experiments features.

<CardGroup cols={2}>
  <Card title="Next: Improve Your Agent" icon="arrow-right" href="/ax/get-started/get-started-improve-your-agent" />

  <Card title="Learn more about Evaluations" icon="book-open" href="/ax/evaluate/evaluators" />
</CardGroup>
