> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Create evaluators

> Define LLM-as-a-judge and code evaluators to measure quality at scale across traces, spans, sessions, and experiments.

## From human review to automated evaluation

Once you understand your failure modes through human review, the next step is automating those checks. Evaluators let you measure quality at scale, turning subjective judgments into measurable results so you can track improvements over time and catch regressions early.

Once you create an evaluator, you run it over your data using a task. Task setup is covered on the [next page](/ax/evaluate/run-evals-on-traces).

<Frame caption="Evaluator detail">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/eval%20detail.png" alt="Arize AX Evaluators UI showing an LLM-as-a-judge evaluator with name and span scope, judge model and prompt template comparing human ground truth to model output, aligned and not aligned choice labels with scores, optimization direction set to maximize, and version history in the sidebar" />
</Frame>

## What is an evaluator

An evaluator looks at your data and returns a structured result, including some combination of label (e.g. correct / incorrect), a numeric score, and an explanation.

Evaluators are versioned so every change is tracked. When you create an evaluator you also set its scope (span, trace, session, or experiment), which determines what unit of data it sees and where results appear.

<Frame caption="How an LLM evaluator scores your data">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/llm_eval.png?v=20260415" alt="Diagram of an LLM-as-a-judge evaluator showing metadata including scope span trace session and experiment, prompt template with query reference and output variables, data injection into the template, and structured output with score label and explanation after Run eval" />
</Frame>

An **LLM-as-a-judge** evaluator defines a name, scope (span, trace, session, or experiment), judge model, and optimization direction. The prompt template references variables like `` `{query}` ``, `` `{reference}` ``, and `` `{output}` ``, which are mapped to your data at runtime. After running, the evaluator returns a structured output: a numeric score, a label (for example **Correct**), and an explanation.

LLM evaluators also have an optimization direction. Maximize when higher scores are better, minimize when lower scores are better. This tells Arize how to color results so you can see at a glance what is performing well and what needs attention.

<Frame caption="How a code eval scores your data">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/code_eval_1.png?v=20260415" alt="Diagram of a code evaluator showing metadata with eval column name and scope span trace session and experiment, Python CodeEvaluator class and evaluate method, dataset_row span attribute keys as data inputs, and structured output with score label and explanation after Run eval" />
</Frame>

A **code evaluator** runs a Python class with an `evaluate` method. The `dataset_row` input is a dictionary of span attributes—common keys include `attributes.output.value`, `attributes.input.value`, and `attributes.llm.token_count.total`. Use `.get()` to handle missing keys gracefully. The evaluator returns a structured output with a score, label (for example pass), and an explanation.

## What kind of eval do I need?

Start with what you learned in error analysis. For each failure mode, ask: is this subjective or deterministic?

**Subjective or nuanced criteria:** use an [LLM-as-a-judge](/ax/evaluate/create-evaluators#llm-as-a-judge) evaluator for high-volume checks with stable column mappings, or [Harness as a Judge](/ax/evaluate/harness-as-a-judge) when the agent should read trace context at run time without upfront mappings. Examples: helpfulness, tone, correctness, agent trajectory quality.

**Objective and rule-based criteria:** use a code evaluator. Examples: JSON validation, regex matching, keyword presence.

Most applications mix evaluator types. You can attach multiple eval tasks to the same project for layered coverage.

<h2 id="scope">
  Scope
</h2>

When you create an evaluator, you define its scope, which tells the evaluator what unit of data to look at and where results appear.

| Scope       | Use when                                                                                                               |
| ----------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Span**    | The evaluation target is self-contained in one operation—for example, whether this retrieval returned relevant results |
| **Trace**   | You need to assess reasoning quality or context retention across the full request flow                                 |
| **Session** | You want to evaluate the overall effectiveness of a multi-turn conversation                                            |

<h2 id="llm-as-a-judge">
  LLM-as-a-Judge
</h2>

Use an LLM to assess outputs based on a prompt and criteria you define. You can create one from wherever you are in your workflow:

* **[Evaluator Hub](/ax/evaluate/create-evaluators#evaluator-hub)**: Create and manage evaluators to reuse across any project or experiment.
* **[Tracing](/ax/observe/tracing)**: Create an eval directly from a trace, span, or session when you spot something worth measuring.
* **[Datasets and experiments](/ax/develop/datasets-and-experiments)**: Set up an eval to score experiment runs against your golden dataset.
* **[Prompt Playground](/ax/prompts/prompt-playground)**: Test and iterate on your eval or run an eval on your prompt experiments.

<Frame caption="Evaluator Hub lists saved LLM judges and their configuration">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/eval%20hub.png" alt="Evaluators page with Evaluator Hub tab selected, showing a table of LLM evaluators with scope, judge model, maintainer, and usage" />
</Frame>

## Setup Instructions

Set up an [AI provider integration](/ax/security-and-settings/integrations-playground/overview), write your eval template, map variables to your data, and save to the Evaluator Hub. For when to use **span**, **trace**, or **session** scope, see **[Scope](#scope)** above.

You can create an LLM as a judge directly via the UI; you can also get Alyx or Arize Skills to do it for you.

<Tabs>
  <Tab title="By Arize Skills">
    Use the [Arize skills plugin](/ax/agents/arize-skills) in your coding agent and the [arize-evaluator skill](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md) to create evaluators via the `ax` CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:

    * "Create a hallucination evaluator for my project"
    * "Create an evaluator from blank with correct/incorrect labels"
    * "Update the prompt on my correctness evaluator"

    ![Coding agent terminal using the arize-evaluator skill and ax CLI to create an evaluator from natural language](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/eval-new.png)

    <br />

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/eval_skills.png" alt="Create Via Agent (Skills) modal with install command, API key and space ID setup, and an example prompt for your coding agent" />
    </Frame>
  </Tab>

  <Tab title="By Alyx">
    Describe what you want to measure in plain language and Alyx will write the evaluator prompt for you, generate the labels and score mapping, and save it to the Evaluator Hub.

    * "Create an evaluator that checks if customer support responses are empathetic and provide actionable next steps"
    * "Write a hallucination evaluator for my RAG pipeline"
    * "Create a correctness evaluator for my project"

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/evals%20alyx.png" alt="Trace view with Ask Alyx open, including Suggest an eval to catch similar trace errors" />
    </Frame>
  </Tab>

  <Tab title="By UI">
    <h3 id="tutorial-create-eval-from-trace-ui">From a trace or span</h3>

    In the tracing UI, open a trace and use **Add Trace Eval** in the trace header to score the full trace, or select a span and use **Add Span Eval** in the span details panel for span-level judges.

    <Frame caption="Add Trace Eval or Add Span Eval while inspecting a trace">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/spanslideover_eval.png" alt="Trace detail view with span tree and span input or output, highlighting Add Trace Eval in the header and Add Span Eval in the selected span panel, with Ask Alyx open on the right" />
    </Frame>

    <h3 id="tutorial-run-pre-built-evals-on-your-traces">Use a pre-built template</h3>

    **Use a pre-built template** if a generic quality dimension covers your needs. Arize AX includes tested templates for common scenarios:

    | Template         | What it measures                                              |
    | ---------------- | ------------------------------------------------------------- |
    | Hallucination    | Outputs containing information not supported by the reference |
    | Relevance        | Whether responses address the input question                  |
    | Toxicity         | Harmful or inappropriate content                              |
    | Helpfulness      | How useful the response is to the user                        |
    | Q\&A Correctness | Answer accuracy given reference documents                     |
    | Summarization    | Whether summaries capture the source material                 |
    | User Frustration | Signs of frustration in conversations                         |
    | Code Generation  | Code correctness and readability                              |
    | SQL Generation   | SQL query correctness                                         |
    | Tool Calling     | Function call accuracy and parameter extraction               |

    These templates are built into [Phoenix Evals](https://arize.com/docs/phoenix) and tested against benchmarked datasets. You can access them directly in the Arize AX UI when creating an evaluation, or use them programmatically in code.<span id="step-2-create-an-evaluation-task" />

    1. Navigate to **New Eval Task** and select **LLM-as-a-Judge**
    2. Click **Add Evaluator** and select a template
    3. Set the scope — span, trace, or session
    4. Configure your judge model and AI provider (use a different model than the one you're evaluating)
    5. Map your trace or dataset attributes to the template variables
    6. Click **Create**

    <Frame caption="Create an eval from a tracing project">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/trace_eval.png" alt="Traces table with span kinds, filters, latency and token summaries, Eval Tasks control, and Ask Alyx panel" />
    </Frame>

    <br />

    <Frame caption="Create an eval from eval hub">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/create_eval.png" alt="Create Evaluator modal for an LLM judge showing Hallucination template, span scope, judge model, prompt rubric, optional test mapping, and Ask Alyx" />
    </Frame>

    <h3 id="tutorial-create-a-custom-llm-as-a-judge-eval">Create from blank</h3>

    **Create from blank** if your application has specific criteria that generic templates can't capture.

    1. Navigate to **New Eval Task** and select **LLM-as-a-Judge**
    2. Click **Add Evaluator**, then **Create From Blank**
    3. Name the evaluator and write a **prompt template**; see below for what makes a successful prompt template.
    4. Define **output labels** (e.g. correct / incorrect) and **scores**
    5. Configure the **judge model** and save

    <Frame caption="Write your own eval template">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/create_blank.png" alt="Create Evaluator modal for a new blank evaluator with name and span scope, eval template placeholder with variable hints, choice rows mapping labels to scores, optimization direction, and optional test evaluator panel with dataset and variable mapping" />
    </Frame>

    <h3 id="tutorial-writing-a-prompt-template">Writing a prompt template</h3>

    A successful prompt template has four elements:

    <h4>Define the judge's role</h4>

    In the first part of your prompt, define the judge's role. Avoid framing like "you are an expert evaluator": it rarely helps and can sometimes make results worse. Instead, focus on giving the judge context: what type of system it is evaluating, what industry or domain that system operates in, and what the judge's task is. For example, telling the judge "you are identifying issues with the relevance of an agent's responses so we can improve the experience for our users" establishes the system under evaluation, the quality dimension you care about, and the goal of the evaluation.

    <h4>Explicit criteria</h4>

    Avoid ambiguous or aspirational instructions like "a good response" or "a helpful answer". Focus on explicit instructions: what specific elements of a response would make it helpful? For example, for a financial agent, one criterion might be "Contains a specific buy/sell/hold recommendation", or for a customer service agent it might be "mentions specific actions to take in the UI to resolve the issue".

    Also include criteria for failure: what would make the response **not helpful**? This is often drawn from inspecting traces.

    Be careful not to over-specify. Modern LLMs follow instructions very closely, so a long list of rigid rules can constrain the judge in ways you don't intend. A criterion like "must contain a specific buy/sell/hold recommendation" may be too strict compared to a more open-ended goal like "consider whether the response provides an appropriate next step when the user asks for advice on whether to buy, sell, or hold an asset" — especially when the judge already has the context that it is evaluating a system inside a financial institution.

    <h4>Include labeled data</h4>

    Include variable names that will be expanded at runtime into the inputs and outputs of the template, e.g. `{input}` and `{output}`. Surround these variables with clear labels to the LLM so that it understands where your instructions end and inputs and outputs begin and end. XML tags are a clear way to mark where blocks begin and end:

    ```
    <user_query>
    {input}
    </user_query>

    <financial_report>
    {output}
    </financial_report>
    ```

    <h4>Don't specify the output format</h4>

    You don't need to tell the LLM what labels to output or describe a response format in your prompt. You define the possible responses externally, as the evaluator's **Choices** in the UI. AX defines a single tool with the available choices and requires the model to call it. For models that don't support tool calling, instructions are appended to the system prompt instead. In both cases, the output specification and parsing is handled for you.
  </Tab>
</Tabs>

<span id="tutorial-run-code-evals-on-your-traces" />

<h2 id="code-evaluations">
  Code evaluators
</h2>

Code evaluators run deterministic Python logic against your trace data. Faster, cheaper, and more consistent than LLM evals for objective checks.

<Tabs>
  <Tab title="By Arize Skills">
    Use the [Arize skills plugin](/ax/agents/arize-skills) in your coding agent and the [arize-evaluator skill](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md) to create code evaluators and tasks via the `ax` CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:

    * "Create a code evaluator that checks if the output is valid JSON"
    * "Set up a regex evaluator that checks for a phone number in the response"
  </Tab>

  <Tab title="By Alyx">
    Ask Alyx to create a code evaluator for you:

    * "Create a code evaluator that checks if the output is valid JSON"
    * "Set up a regex evaluator that checks for a phone number in the response"
  </Tab>

  <Tab title="By UI">
    Arize AX includes managed code evaluators for common patterns:

    | Evaluator                 | What it checks                        | Parameters               |
    | ------------------------- | ------------------------------------- | ------------------------ |
    | **Matches Regex**         | Whether text matches a regex pattern  | span attribute, pattern  |
    | **JSON Parseable**        | Whether the output is valid JSON      | span attribute           |
    | **Contains Any Keyword**  | Whether any specified keywords appear | span attribute, keywords |
    | **Contains All Keywords** | Whether all specified keywords appear | span attribute, keywords |

    1. Navigate to the **Evaluators** page and click **New Evaluator**, then select **Code Evaluator**.
    2. Enter an evaluator name and set the **scope**.
    3. Define imports and write your evaluator class.
    4. Configure variable mapping.
    5. Expand **Advanced Options** if needed.
    6. Click **Create**.

    <Frame caption="Create code evals">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/code_eval.png" alt="Create Evaluator modal showing Python code for a span-level evaluator, sample data mapping, and Alyx in the side panel" />
    </Frame>

    <Info>
      Custom code evaluators are available on Arize AX Enterprise.
    </Info>
  </Tab>

  <Tab title="By Code">
    A custom code evaluator is a Python class that extends `CodeEvaluator` and implements a single `evaluate` method.

    ```python theme={null}
    # Note: This example uses Python SDK v7
    from typing import Any, Mapping, Optional
    from arize.experimental.datasets.experiments.evaluators.base import (
        EvaluationResult,
        CodeEvaluator,
        JSONSerializable,
    )

    class ContainsHelloEvaluator(CodeEvaluator):
        def evaluate(
            self,
            *,
            dataset_row: Optional[Mapping[str, JSONSerializable]] = None,
            **kwargs: Any,
        ) -> EvaluationResult:
            output = dataset_row.get("attributes.output.value") if dataset_row else None
            text = str(output or "").lower()

            if "hello" in text:
                return EvaluationResult(
                    label="pass",
                    score=1.0,
                    explanation="Output contains 'hello'"
                )

            return EvaluationResult(
                label="fail",
                score=0.0,
                explanation="Output does not contain 'hello'"
            )
    ```

    The `dataset_row` dictionary contains span attributes. Common keys include `attributes.output.value`, `attributes.input.value`, and `attributes.llm.token_count.total`. Access values using `.get()` to handle missing keys gracefully.

    Your `evaluate` method must return an `EvaluationResult` with a `label` (e.g. "pass" / "fail"), an optional `score` (e.g. 1.0 / 0.0), and an `explanation`.

    You can import any of the following supported packages: `numpy`, `pandas`, `scipy`, `pydantic`, `jellyfish`. Contact support to request additional packages.

    While it's possible to write code directly in the UI, it's easier to iterate in a Python script or Colab notebook first. Use the **Test in Code** button in the task creation interface to get starter code, then copy-paste your evaluator into the UI when ready.
  </Tab>
</Tabs>

<h2 id="evaluator-hub">
  Evaluator Hub
</h2>

LLM-as-a-judge evaluators are saved to the Evaluator Hub, your centralized place for managing, versioning, and reusing evaluators. The **Evaluators** page has two tabs:

**Eval Hub** is where evaluators are defined and managed. Create an evaluator once and attach it to any task: online monitoring, offline batch runs, or dataset experiments, without rewriting prompts or reconfiguring models. Every change is tracked with a version history and commit messages so you know what changed and why.

**Running Tasks** is where evaluation tasks execute evaluators against your data. A task connects an evaluator to a data source and runs it on a schedule or as a one-time batch.

When you attach an evaluator to a task, you map its template variables to your datasource columns. This is what makes evaluators portable. The same evaluator works across datasets and projects with different schemas, just update the column mappings.

## Eval best practices

<CardGroup cols={3}>
  <Card title="Binary vs Score Evals" href="https://arize.com/blog/testing-binary-vs-score-llm-evals-on-the-latest-models/" />

  <Card title="Should I Use the Same LLM for my Eval as My Agent?" href="https://arize.com/blog/should-i-use-the-same-llm-for-my-eval-as-my-agent-testing-self-evaluation-bias/" />

  <Card title="Eval Cookbooks" href="/ax/cookbooks/evaluation/tracing-and-evaluating-audio" />
</CardGroup>

## Further reading

* [Run online evals on traces](/ax/evaluate/run-evals-on-traces)
* [Run offline evals on experiments](/ax/evaluate/run-evals-on-experiments)
* [Code evals reference](/ax/evaluate/evaluators/code-evaluations)
