> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Run online evals on traces

> Run online evals over your production trace data. Ground what you automate in trace review and a clear failure taxonomy first.

## What is a task?

A task connects your evaluator to a data source and defines what to score and how often. You create an evaluator once and reuse it across tasks — pointing it at different projects, datasets, or experiments. Results attach automatically and surface in your project or experiment.

Most teams start with a one-time backfill on historical data to establish a baseline, then set up an ongoing task from there.

Before creating a task, make sure you have traces flowing into Arize AX and an LLM provider configured. See [AI Provider Integrations](/ax/security-and-settings/integrations-playground/overview).

<Frame caption="Evaluator and task workflow">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/tasks.png" alt="Workflow diagram from create evaluator in Eval Hub through create task with target and sampling, runs over tracing or experiment data with scope, view results with scores and task logs, and investigate with view evals or jump to trace, with a loop back to edit or improve the evaluator" />
</Frame>

## Start from real traces

Before automating, review real interactions in your [tracing project](/ax/observe/tracing) to understand where things go wrong. Group failure patterns into a taxonomy — each category can map to an evaluator or filter. To capture those categories as structured labels, see [Human review](/ax/evaluate/human-review).

<Frame caption="Playground Traces">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/eval%20metric%20traces.png" alt="Arize AX tracing project showing Playground Traces with summary cards for traffic, span latency, tokens and cost, a traces table with LLM rows and input and output columns, filters and date range, and Ask Alyx open on the right" />
</Frame>

<span id="setting-up-online-evals" />

<h2 id="create-a-task">
  Create a task
</h2>

There are several ways to create a task and run your evaluator on traces.

<Frame caption="Create a task to run over your data">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/create%20task.png" alt="Evaluators page with Evaluator Hub tab and New Task side panel showing task name, project and trace source with an LLM span filter, an added span evaluator, Run Continuously on with 100 percent sampling, and Create Task" />
</Frame>

<Tabs>
  <Tab title="By Arize Skills">
    Use the [arize-evaluator skill](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-evaluator/SKILL.md) to create and trigger tasks via the `ax` CLI without leaving your editor. Install the [Arize skills plugin](/ax/agents/arize-skills) in your coding agent if you have not already. Then ask your agent:

    * "Create a continuous task to run my hallucination evaluator on my project"
    * "Trigger a backfill eval run on my project for the last 7 days"
    * "Set up a task that only evaluates LLM spans"

    <Frame caption="Task skill">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/task-skill.png" alt="Terminal showing ax tasks create for a Hallucination Monitor continuous task, success with LLM span filter and input output column mapping, and agent follow-up explaining LLM-only span scoring" />
    </Frame>
  </Tab>

  <Tab title="By Alyx">
    Ask [Alyx](/ax/alyx/meet-alyx) to create a task and run your evaluator on your traces:

    * "Run my correctness evaluator continuously on my production traces"
    * "Backfill my hallucination eval on the last 7 days of spans"
    * "Set up a task to score only LLM spans with my relevance evaluator"

    <Frame caption="Ask Alyx">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/alyx%20task.png" alt="Tracing project with traces table and Ask Alyx open, showing Alyx confirming a continuous eval task with project, sampling, evaluator, and label details" />
    </Frame>
  </Tab>

  <Tab title="By UI">
    You can create a task from several places in Arize AX: from the **Evaluators** page in the left sidebar, from the **Projects** page, or directly from within a span.

    1. **Click New Task** from any of the entry points above.
    2. **Name your task** and select your project as the data source.
    3. **Click Add Evaluator** and select your evaluator from the Eval Hub. You can add multiple evaluators to a single task.
    4. **Configure column mappings** to map template variables to your data.
    5. **Set evaluation granularity:** span, trace, or session.
    6. **Choose cadence:** run continuously on new data or run once on historical data.
    7. **Set sampling rate** and any filters.
    8. **Click Create Task.**

    Once created, results appear automatically in the Tracing view attached to each span. To check on a task, go to the **Running Tasks** tab, open any task, and click **View Logs**. From the logs you can also click **View Traces** to jump directly to the spans that were evaluated with the same filters applied.
  </Tab>

  <Tab title="By Code">
    Use this approach when you need to run evals on large datasets, incorporate external data sources, or want full control over execution and cost. Export your spans, run evals using Phoenix Evals, and log results back to Arize AX via the Python SDK.

    ### 1. Export spans

    From the Tracing page, click **Export** and select **Export to Notebook** to get prefilled export code. Or export programmatically:

    ```python theme={null}
    import os
    from datetime import datetime
    from arize import ArizeClient

    client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])

    primary_df = client.spans.export_to_df(
        space_id=os.environ["ARIZE_SPACE_ID"],
        project_name="your-project-name",
        start_time=datetime.fromisoformat(''),  # prefilled by export
        end_time=datetime.fromisoformat(''),    # prefilled by export
    )
    ```

    ### 2. Run evals

    Check which attributes are present with `primary_df.columns`, then map your input and output columns:

    ```python theme={null}
    primary_df["input"] = primary_df["attributes.input.value"]
    primary_df["output"] = primary_df["attributes.output.value"]

    from phoenix.evals import create_classifier
    from phoenix.evals.evaluators import async_evaluate_dataframe
    from phoenix.evals.llm import LLM

    MY_SAMPLE_TEMPLATE = '''
        You are evaluating the positivity or negativity of the responses to questions.
        [BEGIN DATA]
        ************
        [Question]: {input}
        ************
        [Response]: {output}
        [END DATA]

        Please focus on the tone of the response.
        Your answer must be single word, either "positive" or "negative"
        '''

    llm = LLM(provider="openai", model="gpt-5")

    sample_evaluator = create_classifier(
        name="sample-eval",
        llm=llm,
        prompt_template=MY_SAMPLE_TEMPLATE,
        choices={"correct": 1.0, "incorrect": 0.0},
    )

    results_df = await async_evaluate_dataframe(
        dataframe=primary_df,
        evaluators=[sample_evaluator],
    )
    ```

    <Tip>
      It is easier to iterate on your evaluator in a Python script or Colab notebook first. Use the **Test in Code** button in the task creation interface to get starter code, then copy your evaluator into the UI when ready. For the in-product **Create Evaluator** layout (imports, class, and sample-data mapping), see [Create evaluators](/ax/evaluate/create-evaluators#code-evaluations).
    </Tip>

    ### 3. Log results back to Arize AX

    Results require four columns: `eval.<name>.label`, `eval.<name>.score`, `eval.<name>.explanation`, and `context.span_id`. For trace or session evals use the prefixes `trace_eval.<name>` and `session_eval.<name>`.

    ```python theme={null}
    import os
    from arize import ArizeClient
    from phoenix.evals.utils import to_annotation_dataframe

    client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
    sample_eval_df = to_annotation_dataframe(results_df)

    sample_eval_df = sample_eval_df.rename(columns={
        "label": "eval.correctness.label",
        "score": "eval.correctness.score",
        "explanation": "eval.correctness.explanation"
    })

    client.spans.update_evaluations(
        space_id=os.environ["ARIZE_SPACE_ID"],
        project_name="your-project-name",
        dataframe=sample_eval_df,
    )
    ```

    Evals can be applied to spans up to 14 days prior to the current day. For older spans contact [support@arize.com](mailto:support@arize.com).
  </Tab>
</Tabs>

## Task configuration

### Sampling rate

| Rate       | When to use                                                                |
| ---------- | -------------------------------------------------------------------------- |
| **100%**   | Low-volume or critical applications where you want to evaluate every trace |
| **10–50%** | High-volume applications balancing cost and coverage                       |
| **1–5%**   | Very high-volume applications where representative sampling is enough      |

Start at **10–20%** and increase once you have validated your evaluator is working correctly.

### Filters

Use filters to target specific subsets of your data:

* **Span kind:** Only evaluate specific span types (for example LLM spans)
* **Model name:** Only evaluate spans from a specific model
* **Metadata:** Only evaluate spans with certain metadata tags
* **Span attributes:** Filter on any span attribute

<Frame caption="New Task">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/task%20configs.png" alt="Evaluators page with New Task panel showing target project and traces, a span kind query for LLM spans, Add Evaluator, Run Continuously and sampling, One-Time Backfill, and Advanced options including LLM Override and Enable Tracing" />
</Frame>

## Run evals continuously

For tasks that use **Run continuously on new data**, evaluators from the Eval Hub (including pre-built LLM judge templates) run on incoming traces on a rolling schedule. When you [create a task](#create-a-task) and add an evaluator, you can pick a template from the hub before mapping columns and saving.

On the **Evaluators** page, the **Running Eval Tasks** tab lists every task, its target and evaluators, a snapshot of the last few runs, and **View Logs** when you need execution details.

<Frame caption="Running Eval Tasks">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/running%20eval%20task.png" alt="Evaluators page on Running Eval Tasks tab showing a table of task names, project or dataset targets, attached evaluators, created and last run times, last five runs status pills, and View Logs actions" />
</Frame>

## Viewing results

Once a task runs, evaluation results attach automatically to your spans. Open any trace in the Tracing view and use the evaluation panel on each span to inspect labels, scores, and explanations.

To check task status, view run timing, see counts of successes and errors, or troubleshoot a failed run, navigate to the **Running Tasks** tab on the Evaluators page and open any task. From the logs you can also click **View Traces** to jump directly to the evaluated spans with the same filters applied.

<Frame caption="Span Evaluations on traces">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/trace%20eval%20table.png" alt="Tracing project traces table with a Span Evaluations column showing eval labels such as dietary adherence marked correct per trace, plus latency, tokens, and Ask Alyx" />
</Frame>

## Further reading

* [Span-level evaluations](/ax/evaluate/evaluators/trace-and-session-evals/span-level-evals)
* [Trace-level evaluations](/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations)
* [Session-level evaluations](/ax/evaluate/evaluators/trace-and-session-evals/session-level-evaluations)
* [Agent trajectory evaluations](/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations)
* [Retrieval evaluation (RAG)](/ax/cookbooks/evaluation/retrieval-evaluation)
