Run evaluations in the UI
Create background tasks to run evaluations without code
This guide shows you how to set up online evaluations in the Arize UI. Evaluations run automatically as your application receives new data, or once on historical data. You can configure evaluations with LLM-as-a-Judge or code-based evaluators, and now you can run them across different scopes: span, trace, or session.
Step 1: Create a Task
Navigate to the Tasks page and click New Task.
Select the evaluator type: LLM-as-a-Judge or Code Evaluator.
Enter a task name and choose the associated project.
Note: A single task can include multiple evaluators at different scope

Step 2: Configure Sampling and Filters
Sampling Rate (%) – Define the percentage of data the task should run on (0–100).
Sampling is applied at the highest evaluator scope in the task:
session > trace > span
Lower-level evaluators will run on all matching data within that sampled set.
Task Filters – Specify the data for the task:
For span-level evaluations, filters directly define which spans are evaluated.
For trace and session evaluations, the filters help locate spans, and then the corresponding traces or sessions are used for evaluation.
Schedule Run – Choose whether to:
Run continuously on new data as it arrives (every ~2 minutes).
Backfill historical data. When running on historical data, the maximum number of items is based on the highest eval scope.

Step 3: Add Evaluators
You can add one or more evaluators to a task. For each evaluator:
Evaluator Name – This name is used when logging and accessing evaluation results in the platform.
Scope – Choose the level of evaluation:
Span
A single unit of work within a trace. Examples include a router function selection, a retrieval quality, or a QA Correctness 👉 Use when the evaluation target is self-contained in one operation—you only need information from a single span (e.g., “Did the router the chosen function correct?”).
Trace
A collection of spans that together represent the full execution of a request or application step sequence. Examples include an agent’s full trajectory or a chain of tool calls. 👉 Use when you need to evaluate behavior across multiple spans in one request—such as consistency, latency distribution, or whether the final answer aligns with all intermediate steps.
Session The complete conversation between a user and an agent, spanning multiple traces. Examples include end-to-end session coherence, escalation handling, or user satisfaction. 👉 Use when the evaluation requires context across an entire dialogue—for instance, whether the agent stayed on topic across multiple turns or achieved the user’s goal over the whole interaction.
Evaluator Filters (Optional) – Additional filters to define what data is passed to the evaluator template.
For span-level: Task-level filters are reused.
For trace/session: You can specify additional filters to select only relevant spans.
Example (trace):
span.kind = LLM AND message.tool_calls != null
Example (session):
parent_id = null
(root spans only)

Step 4: Define the Evaluator
Pre-built evaluators – Use Arize’s off-the-shelf evaluators.
Copilot (Alyx) – Automatically generate an evaluation template from a plain-language description.
Custom templates – Write your own evaluation template. Use bracket notation to reference fields (e.g.,
{attributes.llm.input_messages.0.message.content}
), which follow OpenInference semantic conventions.Python evaluators – Write evaluation logic in code.
For trace and session evaluations, you don’t need to change how templates or code are written—we handle aggregation for you:
Trace – Values for prompt variables are aggregated and passed as a list.
Session – Data is passed as a list of dictionaries with index-based keys that preserve order, so the LLM sees the conversation in sequence.
Output Rails: You can define output rails to constrain eval labels. For example, if there are two options, the first maps to 1
and the second to 0
, enabling aggregate scoring.
Explanations: Optionally enable explanations, which capture the LLM’s reasoning or chain of thought for why it assigned a given label. We highly recommend you use explanations.
Step 5: Run and View Results
Once your task is created, it will automatically run based on your configuration. A green confirmation pop-up will appear, and evaluation labels will show up on the Tracing page associated with the selected project.
✨ Tip: Use Alyx to generate evaluations quickly. Just describe what you want to measure, and Alyx will create a template, guardrails, and naming for you.
Last updated
Was this helpful?