> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Code Evals

> Use Python-based evaluators to run deterministic checks against your span data

# Why Use Code Evals?

When your evaluation criteria is deterministic and clear, code-based evaluators provide a consistent and efficient way to assess results. They are useful when you need to check for objective conditions, such as whether a keyword appears, a URL is valid, or a format follows a rule.

Arize offers off-the-shelf code evaluators for common evaluation tasks. When you need more control, you can create custom evaluators that align with your unique business logic or quality criteria.

# Code Evaluators in the Eval Hub

Code evaluators are managed in the **Eval Hub** (navigate to **Evaluators** in the left sidebar, then select the **Eval Hub** tab). Create an evaluator once and reuse it across multiple tasks. Evaluators are versioned with commit messages, so you can track changes over time.

# Creating a Code Evaluator

To create a code evaluator, navigate to the **Eval Hub** tab and click **New Evaluator**, then select **Code Evaluator**. You can start from a pre-built template or write your own.

## Pre-built Templates

Arize provides ready-to-use code evaluators for common checks: **Matches Regex**, **JSON Parseable**, **Contains any Keyword**, **Contains all Keywords**, and **Exact Match**. When you select a template, the full Python source code is displayed in the editor (read-only) so you can see exactly what it does. Configure the evaluator by setting its parameters (e.g., the regex pattern or keyword list) and save it to the Hub.

After you choose a template:

1. Provide a unique **Eval Column Name** for the evaluator in plaintext. Ensure that this name is distinct from other evaluators across all tasks. Here, you can also set **Evaluator Scope** and **Filters**.
2. Define any required parameters for the selected code evaluator.

<Frame caption="Create Evaluator: define imports and evaluator class, then map sample data">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/code_eval.png" alt="Create Evaluator modal showing Python code for a span-level evaluator and test mapping against sample data" />
</Frame>

## Pre-built evaluator reference

Arize manages a set of off-the-shelf code evaluators on your behalf. Select an evaluator from the drop-down; the evaluator code is provided for you. Customize evaluators by specifying arguments as parameters. The table below summarizes each built-in template.

<table><thead><tr><th width="182.37109375">Eval</th><th width="199.96484375">Description</th><th>Parameters</th></tr></thead><tbody><tr><td><strong>Matches Regex</strong></td><td>Checks whether the text matches a specified regex pattern</td><td><ul><li><strong>span attribute</strong>: which span attribute to look at</li><li><strong>pattern</strong>: The regex pattern used for matching against the span attribute value.</li></ul></td></tr><tr><td><strong>JSON Parseable</strong></td><td>Checks whether the LLM data is a valid JSON-parsable string</td><td><ul><li><strong>span attribute</strong>: which span attribute to look at</li></ul></td></tr><tr><td><strong>Contains any Keyword</strong></td><td>Checks whether any specified keywords are present in the LLM data</td><td><ul><li><strong>span attribute</strong>: which span attribute to look at</li><li><strong>keywords</strong>: A list of keyword strings to search for in the span attribute. If any keyword matches, the evaluator will flag the data as a match.</li></ul></td></tr><tr><td><strong>Contains all Keywords</strong></td><td>Checks that all specified keywords are present in the LLM data</td><td><ul><li><strong>span attribute</strong>: which span attribute to look at</li><li><strong>keywords</strong>: A list of keyword strings; the evaluator flags a match only when every keyword is present.</li></ul></td></tr><tr><td><strong>Exact Match</strong></td><td>Checks whether the output exactly matches the expected output</td><td><ul><li><strong>span attribute</strong>: which span attribute to look at</li><li><strong>expected output</strong>: The reference string to compare against.</li></ul></td></tr></tbody></table>

## Writing Your Own Evaluator

<Info>
  Custom Code Evaluators are only available in [Arize AX Enterprise](https://arize.com/pricing/). Request a demo [here](https://arize.com/request-a-demo/).
</Info>

Select **Create Custom** to write your own evaluation logic in Python. The editor opens with a default template:

```python theme={null}
from typing import Any, Optional
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    CodeEvaluator,
)

class MyEvaluator(CodeEvaluator):
    """Custom evaluator -- edit this class to define your evaluation logic."""

    def evaluate(
        self,
        *,
        input1: Optional[str] = None,
        input2: Optional[str] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        is_valid = input1 is not None and input1.strip() != ""
        return EvaluationResult(
            label="valid" if is_valid else "invalid",
            score=float(is_valid),
            explanation=(
                "input1 is non-empty."
                if is_valid
                else "input1 is empty or missing."
            ),
        )
```

Replace `input1` and `input2` with named arguments that describe the data your evaluator needs. Each named keyword argument in the `evaluate()` method signature becomes a **variable** — when you [use the evaluator in a task](#using-a-code-evaluator), you'll map each variable to a span attribute or dataset column. You can name variables anything you want (e.g., `user_query`, `assistant_response`, `ground_truth`). `self`, `dataset_row`, and `**kwargs` are excluded from mapping.

The `evaluate()` method must return an `EvaluationResult` with:

* **`label`** — A categorical string (e.g., `"pass"`, `"fail"`)
* **`score`** — A numeric value quantifying the result
* **`explanation`** — A brief rationale for the result

### Static Input Parameters

In addition to variables (which change per row), you can define **static input parameters** — configuration values set once that stay the same for every row. This makes evaluators reusable without editing code. For example, a regex evaluator can be reused for different patterns just by changing its `pattern` parameter.

Static parameters are accessed via `self.param_name` and can be typed as `String`, `StringArray` (comma-separated list), or `Regex`.

```python theme={null}
class MyEvaluator(CodeEvaluator):
    def evaluate(self, *, output=None, **kwargs) -> EvaluationResult:
        score = len(output) / 100 if output else 0
        return EvaluationResult(
            label="pass" if score >= self.threshold else "fail",
            score=score,
            explanation=f"Score {score} vs threshold {self.threshold}",
        )
```

In this example, `threshold` is a static parameter configured in the UI when creating or editing the evaluator.

### Accessing Data via `dataset_row`

For evaluators that need access to more data than the named variables provide, include a `dataset_row` parameter:

```python theme={null}
def evaluate(self, *, output=None, dataset_row=None, **kwargs) -> EvaluationResult:
    # `output` is a normal variable (mapped via column mapping)
    # `dataset_row` is a dict containing ALL mapped span attributes
    user_id = dataset_row.get("attributes.metadata.user_id") if dataset_row else None
```

When the system detects `dataset_row` in your method signature, the UI displays an **Additional Span Attributes** section where you can add extra attributes to include in the dict.

<Tip>Use **named variables** when you know exactly which data points your evaluator needs — they are cleaner and self-documenting. Use **`dataset_row`** when you need access to a dynamic or large set of attributes that may vary between use cases.</Tip>

### Supported Packages

Custom evaluators run in a sandboxed environment with the following packages available:

```
numpy
pandas
scipy
arize[Datasets]==7.25.7
pydantic==2.11.7
jellyfish==1.2.0
```

If you need an additional package, contact the Arize support team.

### Editor Features

* **Real-time validation**: As you write code, the system validates it server-side and extracts variable names from your `evaluate()` method signature automatically. Errors are shown inline.
* **Expand-to-modal**: Click the expand button to open a full-screen editor for complex evaluator code.
* **Split-pane layout**: The left panel contains your code and configuration; the right panel shows variable mappings and a live data preview.

***

# Using a Code Evaluator

Once an evaluator exists in the Hub, you can attach it to a task to run against production data, or use it in the Prompt Playground for experiment runs.

## In a Task

There are two ways to add a code evaluator to a task:

* **From the Hub**: Find the evaluator in the Eval Hub tab and click **Use Evaluator**. This opens the task creation flow with the evaluator pre-selected.
* **During task creation**: Click **New Task**, then use the evaluator selector to pick an evaluator **From Hub**, **Create from Template**, or **Create Custom**.

When you add an evaluator to a task, the evaluator definition is read-only — you configure the [column mappings](#column-mapping) to connect the evaluator's variables to your specific data.

<Accordion title="Task Configuration Details">
  1. **Sampling Rate (%):** Define the percentage of data the task should run on (0–100).
     1. Sampling is applied at the **highest evaluator scope** in the task: `session > trace > span`
     2. Lower-level evaluators will run on all matching data within that sampled set
  2. **Task Filters** allow you to specify the data this task will run on. This matches spans, or traces/sessions that contain matching spans.
  3. When **running on historical data**, the maximum number of items is based on the highest eval scope
</Accordion>

## In the Prompt Playground

In a dataset-backed Playground session, you can attach code evaluators to experiment runs:

1. Click the evaluator button in the Playground
2. Select a Hub code evaluator or create one from a template
3. A task is auto-created and attached to the playground run

## Column Mapping

When you use an evaluator in a task, you map each of its variables to a span attribute or dataset column. This is what makes evaluators reusable — the same evaluator can work with different data schemas by changing the mappings.

The column mapping UI shows:

* One row per variable, with the variable name displayed as a label
* A dropdown for each variable where you select a span attribute (e.g., `attributes.output.value`, `attributes.input.value`) or dataset column
* The dropdown is populated from the actual columns available in your project or dataset
* You can also type a custom attribute name if it is not in the dropdown
* A **live preview** below each mapping shows the actual value from a sample span, so you can verify the mapping is correct before saving

<Tip>**Auto-mapping**: When a variable name exactly matches a column name in your data source, the system automatically fills in the mapping. You can override this at any time.</Tip>

**How it all fits together:**

1. **Define** — You write your `evaluate()` method with named parameters (e.g., `output`, `expected_output`)
2. **Discover** — The system parses your code and extracts the parameter names as variables
3. **Map** — You map each variable to a span attribute or dataset column in the UI (e.g., `output` → `attributes.output.value`)
4. **Run** — At runtime, the system fetches the mapped attributes and passes them as keyword arguments to `evaluate()`
5. **Result** — The evaluator returns an `EvaluationResult` with `label`, `score`, and `explanation`

***

# Testing Locally

While you can write and test evaluator code in the UI, it is often easier to iterate in a local Python environment. The **Test in Code** button generates starter code that you can run locally.

<Steps>
  <Step title="Click &#x22;Test in Code&#x22;">
    In the evaluator form, click the **Test in Code** button. The system validates your code first — if there are errors, they are shown before proceeding.
  </Step>

  <Step title="Copy the generated code">
    A slide-over opens with a generated Python notebook containing:

    * Environment setup (`pip install arize`)
    * Imports and data loading code using your actual API keys
    * Your evaluator class and a helper `run_evaluators()` function
  </Step>

  <Step title="Run locally">
    Paste the code into a Python script or Jupyter notebook and run it against your data. Once you are seeing the desired results, copy the updated evaluator class back into the UI.
  </Step>
</Steps>

***

# Additional Resources

<CardGroup cols={3}>
  <Card title="Online Evals" href="/ax/evaluate/run-evals-on-traces" />

  <Card title="LLM as a Judge" href="/ax/evaluate/evaluators/llm-as-a-judge" />

  <Card title="Datasets and Experiments" href="/ax/develop/datasets-and-experiments/run-experiment" />
</CardGroup>
