Online Evals

Online evaluation runs on production data, providing real-world feedback to show how your system performs under actual usage. It connects your application’s live outputs to evaluation results, allowing you to monitor and measure performance in real time. You can use any predefined evaluators or define your own to track the metrics that matter most.

Why is Online Evaluation important?

Online evals are a critical part of error analysis and continuous monitoring. As your application scales, it becomes impossible to manually inspect every trace or span. Online evals automate this process, helping you:

  • Discover issues in live behavior

  • Validate that offline improvements hold up in production

Monitoring and running evals on production data often surfaces more valuable insights than unit tests alone. Real-world feedback helps teams catch issues early & maintain quality as the system evolves.

Why Arize for Online Evals?

Arize AX provides the foundation to make online evaluation reliable and scalable:

  • Scales to large volumes of data so you can evaluate continuously without bottlenecks

  • Tasks allow for grouping, filtering, and sampling to target specific subsets of traces or sessions

  • Fine-grained logs and dashboards for tracking evaluator outputs, trends, and anomalies over time

What are Tasks?

Tasks are automations that continuously run your evaluators on incoming data. They ensure your evaluations stay up to date without requiring manual triggers or intervention.

Once configured, tasks automatically execute on new production data every two minutes, enabling continuous monitoring and fast detection of issues as they emerge.

How Tasks Work

You can configure tasks to control when, where, and how evaluators run. Common configurations include:

  • Sampling rate – Choose how much of your incoming data to evaluate.

  • Filters – Target specific slices of your data, such as LLM spans

  • Group Evaluators – Combine multiple evaluators under one task to track related metrics together.

Task Results

When a task runs, the evaluation results are automatically attached to your traces and spans within Arize. Each evaluation produces structured feedback that can be found in the evaluation tab of each span. This lets you analyze performance directly in context.

Get Started with Online Evaluation

Create a Task

  • Navigate to the Evaluator Tasks page and click New Task.

  • Enter a task name and choose the associated project or dataset. A single task can include multiple evaluators.

  • Schedule the task to run continuously on new incoming data.

Add and Define Evaluators

You can define multiple evaluators within a Task, and each can run at varying scopes (span, trace, or session) to assess performance across all levels.

There are a few ways to define your evaluator:

  1. Pre-built Evaluators: Use Arize’s off-the-shelf evaluators by selecting a template.

  2. Existing Evaluators: Use an existing evaluator you've already defined from the Evaluator Hub

  3. Use the Alyx Eval Builder: Automatically generate an tailored evaluation template from a plain-language description.

View Task Details & Logs

The Logs tab for each task provides a complete view of each task run, showing when it occurred, who triggered it, and the configuration used. You’ll see run timing, user activity, configuration details, and performance metrics, including counts of successes, errors, and skips.

Logs help you answer questions like:

  • Was the task successful or did it fail?

  • Who ran the most recent task and when?

  • What sampling rate and filters were used?

  • Has the evaluation template been updated recently?

If an issue occurs, the logs show exactly what went wrong. Use this view to troubleshoot problems, track changes, and understand how your tasks are performing.

View Traces

The View Traces option in the Logs tab lets you jump directly to the group of spans where the task was executed. When you click it, the system automatically applies the date range and filters that were used, so you can quickly inspect the data tied to that task.

Use this feature to seamlessly connect task results with the actual data for more targeted analysis and troubleshooting.

Last updated

Was this helpful?