> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Run offline evals on experiments

> Run offline evals on datasets and experiments before you ship. Ideal for CI/CD and regression checks.

## Eval-driven development

You have validated evaluators and a golden dataset. Now make them part of your development process. Every time you change a prompt, swap a model, or update your agent logic, score your experiment results against your golden dataset before deploying. Each evaluation runs in a controlled, repeatable environment so you can measure how new versions behave before exposing them to real users.

The same evaluators you use for production monitoring work here. Define the criteria once, apply them everywhere.

This page covers running evaluators on experiments. To create experiments and datasets, see [Datasets and experiments](/ax/develop/datasets-and-experiments).

<Frame caption="View eval results">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/view%20eval%20experiment%20result.png" alt="Experiments tab with summary metrics chart and experiments table, row menu open with View Eval Trace and View Logs" />
</Frame>

## Run on an experiment

Once you have an experiment, attach evaluators and score the results.

<Tabs>
  <Tab title="By Arize Skills">
    Use the [arize-evaluator skill](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md) to create an eval task against an experiment via the `ax` CLI. Install the [Arize skills plugin](/ax/agents/arize-skills) in your coding agent if you have not already. Then ask your agent:

    * "Create an eval task for my v2-prompt-test experiment using my correctness evaluator"
    * "Trigger an eval run on my latest experiment and wait for results"
    * "Score my v2-prompt-test experiment with my hallucination evaluator"

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/experiment-task.png" alt="Terminal showing ax datasets create and help output, ax experiments help, and an in-progress v2-prompt-test experiment with steps to create dataset, run experiment, create evaluator, create eval task, and trigger run" />
    </Frame>

    The skill resolves dataset and experiment IDs, configures column mappings, and triggers the run using `ax tasks create` and `ax tasks trigger-run`.
  </Tab>

  <Tab title="By Alyx">
    Ask [Alyx](/ax/alyx/meet-alyx) to score your experiment results:

    * "Run my correctness evaluator on my latest experiment"
    * "Score my v2-prompt-test experiment with my hallucination eval and show me where it fails"
    * "Compare eval scores between my last two experiments"

    You can also use Alyx directly from your experiment page to write a custom evaluator template.
  </Tab>

  <Tab title="By UI">
    1. Navigate to your experiment page and click **Add Evaluator**.
    2. Define your evaluator or select a pre-built template.

    <Frame caption="Create Evaluator">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/create%20eval%20experiment.png" alt="Create Evaluator modal for a span-level hallucination judge with model and prompt template, optional test on dataset with example preview and variable mapping, and Ask Alyx" />
    </Frame>

    3. Choose the experiments you want to evaluate from the dropdown.
    4. Click **Run** and view results in the experiment table (see [View eval results](#view-results)).

    <Frame caption="Run on Experiment">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/run%20eval%20experiment.png" alt="Run on Experiment modal on a dataset Experiments tab showing selected experiments, options to skip or override existing evaluation labels with the same eval column name, and Cancel and Run actions" />
    </Frame>

    Need help writing a custom evaluator template? Use Alyx to generate one for you directly from the experiment page.
  </Tab>

  <Tab title="By Code">
    **Here's the simplest version of an evaluation function:**

    ```python theme={null}
    def is_true(output):
        # output is the task output
        return output == True
    ```

    ### Evaluation Inputs

    The evaluator function can take the following optional arguments:

    <table><thead><tr><th width="178">Parameter name</th><th width="265">Description</th><th>Example</th></tr></thead><tbody><tr><td><code>dataset\_row</code></td><td>the entire row of the data, including every column as dictionary key</td><td><code>def eval(dataset\_row): ...</code></td></tr><tr><td><code>input</code></td><td>experiment run input, which is mapped to <code>attributes.input.value</code></td><td><code>def eval(input): ...</code></td></tr><tr><td><code>output</code></td><td>experiment run output</td><td><code>def eval(output): ...</code></td></tr><tr><td><code>dataset\_output</code></td><td>the expected output if available, mapped to <code>attributes.output.value</code></td><td><code>def eval(dataset\_output): ...</code></td></tr><tr><td><code>metadata</code></td><td>dataset\_row metadata, which is mapped to <code>attributes.metadata</code></td><td><code>def eval(metadata): ...</code></td></tr></tbody></table>

    ### Evaluation Outputs

    We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.

    <table>
      <thead>
        <tr>
          <th>Evaluator Output Type</th>
          <th>Example</th>
          <th>How it appears in Arize AX</th>
        </tr>
      </thead>

      <tbody>
        <tr>
          <td><code>boolean</code></td>
          <td><code>True</code></td>
          <td>label = 'True'<br />score = 1.0</td>
        </tr>

        <tr>
          <td><code>float</code></td>
          <td><code>1.0</code></td>
          <td>score = 1.0</td>
        </tr>

        <tr>
          <td><code>string</code></td>
          <td><code>"reasonable"</code></td>
          <td>label = 'reasonable'</td>
        </tr>

        <tr>
          <td><code>tuple</code></td>
          <td><code>(1.0, "my explanation notes")</code></td>
          <td>score = 1.0<br />explanation = 'my explanation notes'</td>
        </tr>

        <tr>
          <td><code>tuple</code></td>
          <td><code>("True", 1.0, "my explanation")</code></td>
          <td>label = 'True'<br />score = 1.0<br />explanation = "my explanation"</td>
        </tr>

        <tr>
          <td><code>EvaluationResult</code></td>

          <td>
            <p><code>EvaluationResult(</code></p>
            <p><code>score=1,</code></p>
            <p><code>label='reasonable', explanation='explanation'</code></p>
            <p><code>metadata={}</code></p>
            <p><code>)</code></p>
          </td>

          <td>
            <p>score = 1.0</p>
            <p>label='reasonable'<br />explanation = 'explanation'<br />metadata={}</p>
          </td>
        </tr>
      </tbody>
    </table>

    To use [EvaluationResult class](https://arize-client-python.readthedocs.io/en/latest/llm-api/types.html#arize.experimental.datasets.experiments.types.EvaluationResult), use the following import statement:

    * **Version 7:** `from arize.experimental.datasets.experiments.types import EvaluationResult`
    * **Version 8:** `from arize.experiments import EvaluationResult`
    * One of label or score must be supplied (you can't have an evaluation with no result).

    Here is an example of an evaluator which compares the output to a value in the `dataset_row`.

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      from arize.experiments import EvaluationResult
      import pandas as pd

      # Example dataset
      inventions_dataset = pd.DataFrame({
          "attributes.input.value": ["Telephone", "Light Bulb"],
          "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
      })

      def is_correct(output, dataset_row):
          expected = dataset_row.get("attributes.output.value")
          correct = expected in output
          return EvaluationResult(
              score=int(correct),
              label="correct" if correct else "incorrect",
              explanation="Evaluator explanation here"
          )
      ```

      ```python Python SDK v7 theme={null}
      from arize.experimental.datasets.experiments.types import EvaluationResult
      import pandas as pd

      # Example dataset
      inventions_dataset = pd.DataFrame({
          "attributes.input.value": ["Telephone", "Light Bulb"],
          "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
      })

      def is_correct(output, dataset_row):
          expected = dataset_row.get("attributes.output.value")
          correct = expected in output
          return EvaluationResult(
              score=int(correct),
              label="correct" if correct else "incorrect",
              explanation="Evaluator explanation here"
          )
      ```
    </CodeGroup>

    To run the experiment:

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      experiment, experiment_df = client.experiments.run(
          name="basic-experiment",
          dataset=dataset_id,
          task=answer_question,
          evaluators=[is_correct],
          concurrency=10,
          exit_on_error=False,
          dry_run=False,
      )
      ```

      ```python Python SDK v7 theme={null}
      arize_client.run_experiment(
          space_id="your-arize-space-id",
          dataset_id=dataset_id,
          task=answer_question,
          evaluators=[is_correct],
          experiment_name="basic-experiment",
          concurrency=10,
          exit_on_error=False,
          dry_run=False,
      )
      ```
    </CodeGroup>

    ### Create an LLM Evaluator

    LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.

    Arize AX supports a large number of LLM evaluators out of the box with LLM Classify: [Arize Templates](/ax/evaluate/evaluators/llm-as-a-judge#arize-eval-templates). You can also define custom LLM evaluators.&#x20;

    Here's an example of a LLM evaluator that checks for correctness in the model output:

    ```python theme={null}
    CORRECTNESS_PROMPT_TEMPLATE = """
    You are given an invention (input) and an inventor (output). Determine whether the inventor correctly corresponds to the invention.

    [BEGIN DATA]
    [Inventor]: {invention}
    [Output]: {output}
    [END DATA]

    Explain your reasoning step by step, then provide a single-word LABEL at the end: either "correct" or "incorrect".

    Format:

    EXPLANATION: Your reasoning about why the output is correct or incorrect
    LABEL: "correct" or "incorrect"
    ************
    """
    ```

    #### Run Evaluation

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      from phoenix.evals import llm_classify, OpenAIModel
      from arize.experiments import EvaluationResult
      import pandas as pd

      def correctness_eval(output, dataset_row):
          # Get the original query topic
          invention = dataset_row.get("attributes.output.value")

          eval_df = llm_classify(
              dataframe=pd.DataFrame([{"invention": invention, "output": output}]),
              template=CORRECTNESS_PROMPT_TEMPLATE,
              model=OpenAIModel(model="gpt-4o-mini", api_key="your-openai-api-key"),
              rails=["correct", "incorrect"],
              provide_explanation=True,
          )

          # Map the eval df to EvaluationResult
          label = eval_df["label"][0]
          score = 1 if label == "correct" else 0
          explanation = eval_df["explanation"][0]

          return EvaluationResult(label=label, score=score, explanation=explanation)
      ```

      ```python Python SDK v7 theme={null}
      from phoenix.evals import llm_classify, OpenAIModel
      from arize.experimental.datasets.experiments.types import EvaluationResult
      import pandas as pd

      def correctness_eval(output, dataset_row):
          # Get the original query topic
          invention = dataset_row.get("attributes.output.value")

          eval_df = llm_classify(
              dataframe=pd.DataFrame([{"invention": invention, "output": output}]),
              template=CORRECTNESS_PROMPT_TEMPLATE,
              model=OpenAIModel(model="gpt-4o-mini", api_key="your-openai-api-key"),
              rails=["correct", "incorrect"],
              provide_explanation=True,
          )

          # Map the eval df to EvaluationResult
          label = eval_df["label"][0]
          score = 1 if label == "correct" else 0
          explanation = eval_df["explanation"][0]

          return EvaluationResult(label=label, score=score, explanation=explanation)
      ```
    </CodeGroup>

    In this example, `correctness_eval` evaluates whether the output of an experiment is correct. The `llm_classify` function runs the eval, and the evaluator returns an `EvaluationResult` that includes a score, label, and explanation.

    Once you define your evaluator class, you can use it in your experiment run like this:

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      experiment, experiment_df = client.experiments.run(
          name="test-experiment",
          dataset=dataset_id,
          task=answer_question,
          evaluators=[correctness_eval],
      )
      ```

      ```python Python SDK v7 theme={null}
      arize_client.run_experiment(
          space_id="your-arize-space-id",
          dataset_id=dataset_id,
          task=answer_question,
          evaluators=[correctness_eval],
          experiment_name="test-experiment",
      )
      ```
    </CodeGroup>

    You can customize LLM evaluators to suit your experiment's needs — update the template with your instructions and the rails with the desired output.

    ### Create a Code Evaluator

    Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.&#x20;

    #### Custom Code Evaluators

    Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.

    For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:

    ```python theme={null}
    def in_bounds(output):
        return 1 <= output <= 100
    ```

    By passing the `in_bounds` function to `run_experiment`, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      experiment, experiment_df = client.experiments.run(
          name=experiment_name,
          dataset=dataset_id,
          task=answer_question,
          evaluators=[in_bounds],
      )
      ```

      ```python Python SDK v7 theme={null}
      experiment = arize_client.run_experiment(
          space_id="your-arize-space-id",
          dataset_id=dataset_id,
          task=answer_question,
          evaluators=[in_bounds],
          experiment_name=experiment_name,
      )
      ```
    </CodeGroup>

    #### Prebuilt Phoenix Code Evaluators

    You can also leverage our open-source Phoenix [pre-built code evaluators](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators#code-evaluators).&#x20;

    Pre-built evaluators can be passed directly to the `evaluators` parameter when running experiments.

    <Tip>
      Use `dry_run=True` to test without logging results. Use `concurrency=10` to speed up large runs. Start with synchronous evaluators when debugging, then switch to async for speed.
    </Tip>

    For class-based evaluators and additional patterns, see [Advanced options for running evals on experiments via code](/ax/develop/datasets-and-experiments/create-an-experiment-evaluator/advanced-options-for-running-evals-on-experiments-via-code). For the full `client.experiments` API, see the Python SDK [Experiments](/api-clients/python/version-8/client-resources/experiments) reference.
  </Tab>
</Tabs>

<h2 id="view-results">
  View eval results
</h2>

Results appear in the experiment table alongside each row. Compare multiple experiment runs side by side, filter by eval label, and drill into individual rows to inspect the task output and evaluator explanation.

Open **View Eval Trace** from an experiment row when you want the evaluated trace. For per-example outputs and eval explanations together, use **Compare Experiments** and the **Table** tab.

<Frame caption="Compare Experiments">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/compare%20exp%20eval%20result.png" alt="Compare Experiments table tab showing dataset rows, experiment outputs, Evals column with correct and incorrect tags, and an open popover with score, label, and explanation for an evaluator" />
</Frame>

## The CI/CD workflow

1. Make your change (prompt update, model swap, logic change).
2. Run your experiment against the golden dataset.
3. Score the experiment results with your evaluators.
4. Compare eval scores against the previous experiment run.
5. If scores regress, fix before merging.

You can integrate this into GitHub Actions or GitLab CI/CD so it runs automatically on every push or pull request. See [GitHub Action basics](/ax/develop/datasets-and-experiments/ci-cd-for-automated-experiments/github-action-basics) and [GitLab CI/CD basics](/ax/develop/datasets-and-experiments/ci-cd-for-automated-experiments/gitlab-ci-cd-basics) for full guides.

## Related workflows

* Define evaluators once in [Create evaluators](/ax/evaluate/create-evaluators), then attach them to experiments.
* After you ship, use [Run online evals on traces](/ax/evaluate/run-evals-on-traces) to monitor production with the same criteria where possible.
* Follow the **Develop** tutorials: [Defining the dataset](/ax/develop/tutorial/defining-the-dataset), [Run experiments with code evals](/ax/develop/tutorial/run-experiments-with-code-evals), [Run experiments with LLM judge](/ax/develop/tutorial/run-experiments-with-llm-judge), and [Iteration workflow](/ax/develop/tutorial/iteration-workflow-experiments).
