Documentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
What is a task?
A task connects your evaluator to a data source and defines what to score and how often. You create an evaluator once and reuse it across tasks — pointing it at different projects, datasets, or experiments. Results attach automatically and surface in your project or experiment.
Most teams start with a one-time backfill on historical data to establish a baseline, then set up an ongoing task from there.
Before creating a task, make sure you have traces flowing into Arize and an LLM provider configured. See AI Provider Integrations.
Start from real traces
Before automating, review real interactions in your tracing project to understand where things go wrong. Group failure patterns into a taxonomy — each category can map to an evaluator or filter. To capture those categories as structured labels, see Human review.
Create a task
There are several ways to create a task and run your evaluator on traces.
By Arize Skills
By Alyx
By UI
By Code
Use the arize-evaluator skill to create and trigger tasks via the ax CLI without leaving your editor. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:
- “Create a continuous task to run my hallucination evaluator on my project”
- “Trigger a backfill eval run on my project for the last 7 days”
- “Set up a task that only evaluates LLM spans”
Ask Alyx to create a task and run your evaluator on your traces:
- “Run my correctness evaluator continuously on my production traces”
- “Backfill my hallucination eval on the last 7 days of spans”
- “Set up a task to score only LLM spans with my relevance evaluator”
You can create a task from several places in Arize: from the Evaluators page in the left sidebar, from the Projects page, or directly from within a span.
- Click New Task from any of the entry points above.
- Name your task and select your project as the data source.
- Click Add Evaluator and select your evaluator from the Eval Hub. You can add multiple evaluators to a single task.
- Configure column mappings to map template variables to your data.
- Set evaluation granularity: span, trace, or session.
- Choose cadence: run continuously on new data or run once on historical data.
- Set sampling rate and any filters.
- Click Create Task.
Once created, results appear automatically in the Tracing view attached to each span. To check on a task, go to the Running Tasks tab, open any task, and click View Logs. From the logs you can also click View Traces to jump directly to the spans that were evaluated with the same filters applied. Use this approach when you need to run evals on large datasets, incorporate external data sources, or want full control over execution and cost. Export your spans, run evals using Phoenix Evals, and log results back to Arize via the Python SDK.1. Export spans
From the Tracing page, click Export and select Export to Notebook to get prefilled export code. Or export programmatically:import os
from datetime import datetime
from arize import ArizeClient
client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
primary_df = client.spans.export_to_df(
space_id=os.environ["ARIZE_SPACE_ID"],
project_name="your-project-name",
start_time=datetime.fromisoformat(''), # prefilled by export
end_time=datetime.fromisoformat(''), # prefilled by export
)
2. Run evals
Check which attributes are present with primary_df.columns, then map your input and output columns:primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]
from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM
MY_SAMPLE_TEMPLATE = '''
You are evaluating the positivity or negativity of the responses to questions.
[BEGIN DATA]
************
[Question]: {input}
************
[Response]: {output}
[END DATA]
Please focus on the tone of the response.
Your answer must be single word, either "positive" or "negative"
'''
llm = LLM(provider="openai", model="gpt-5")
sample_evaluator = create_classifier(
name="sample-eval",
llm=llm,
prompt_template=MY_SAMPLE_TEMPLATE,
choices={"correct": 1.0, "incorrect": 0.0},
)
results_df = await async_evaluate_dataframe(
dataframe=primary_df,
evaluators=[sample_evaluator],
)
It is easier to iterate on your evaluator in a Python script or Colab notebook first. Use the Test in Code button in the task creation interface to get starter code, then copy your evaluator into the UI when ready. For the in-product Create Evaluator layout (imports, class, and sample-data mapping), see Create evaluators. 3. Log results back to Arize
Results require four columns: eval.<name>.label, eval.<name>.score, eval.<name>.explanation, and context.span_id. For trace or session evals use the prefixes trace_eval.<name> and session_eval.<name>.import os
from arize import ArizeClient
from phoenix.evals.utils import to_annotation_dataframe
client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
sample_eval_df = to_annotation_dataframe(results_df)
sample_eval_df = sample_eval_df.rename(columns={
"label": "eval.correctness.label",
"score": "eval.correctness.score",
"explanation": "eval.correctness.explanation"
})
client.spans.update_evaluations(
space_id=os.environ["ARIZE_SPACE_ID"],
project_name="your-project-name",
dataframe=sample_eval_df,
)
Evals can be applied to spans up to 14 days prior to the current day. For older spans contact support@arize.com.
Task configuration
Sampling rate
| Rate | When to use |
|---|
| 100% | Low-volume or critical applications where you want to evaluate every trace |
| 10–50% | High-volume applications balancing cost and coverage |
| 1–5% | Very high-volume applications where representative sampling is enough |
Start at 10–20% and increase once you have validated your evaluator is working correctly.
Filters
Use filters to target specific subsets of your data:
- Span kind: Only evaluate specific span types (for example LLM spans)
- Model name: Only evaluate spans from a specific model
- Metadata: Only evaluate spans with certain metadata tags
- Span attributes: Filter on any span attribute
Run evals continuously
For tasks that use Run continuously on new data, evaluators from the Eval Hub (including pre-built LLM judge templates) run on incoming traces on a rolling schedule. When you create a task and add an evaluator, you can pick a template from the hub before mapping columns and saving.
On the Evaluators page, the Running Eval Tasks tab lists every task, its target and evaluators, a snapshot of the last few runs, and View Logs when you need execution details.
Viewing results
Once a task runs, evaluation results attach automatically to your spans. Open any trace in the Tracing view and use the evaluation panel on each span to inspect labels, scores, and explanations.
To check task status, view run timing, see counts of successes and errors, or troubleshoot a failed run, navigate to the Running Tasks tab on the Evaluators page and open any task. From the logs you can also click View Traces to jump directly to the evaluated spans with the same filters applied.
Further reading