Skip to main content
This guide walks you through creating your first online evaluation in Arize AX. By the end, you’ll have a task running that automatically evaluates your production data.

Understanding the Basics

Before we dive into setup, let’s clarify the key concepts you’ll be working with.

What is an Evaluator?

An evaluator measures the quality or performance of your LLM application. It takes your application’s output (and optionally inputs) and produces a score or assessment. An evaluator consists of one of two configurations:
  1. Template Evaluators (LLM-as-a-Judge):
    • An LLM template (prompt that instructs the LLM how to evaluate)
    • Rails (the possible output values/scoring scale)
    • Model configuration (which LLM to use for evaluation)
  2. Code Evaluators:
    • Code definition (Python code that performs the evaluation)
    • Column mappings (which trace/span attributes to use as inputs)
Evaluators can measure things like:
  • Relevance: Does the response answer the question?
  • Toxicity: Is the content safe and appropriate?
  • Factual Accuracy: Are the claims in the response correct?
  • Helpfulness: Is the response useful to the user?
  • Your success metrics: Whatever success means for your agent or organization—evaluators can measure any quality dimension that matters to you
💡 Tip: Template evaluators are great for subjective quality measures, while code evaluators excel at objective checks (e.g., checking if a response contains required keywords or follows a specific format).

What is a Task?

A task is an automation that runs your evaluators on incoming production data. Think of it as a scheduled job that:
  • Continuously evaluates your traces/spans
  • Applies filters to target specific data (e.g., only LLM spans)
  • Runs evaluators at a specified sampling rate
  • Attaches evaluation results directly to your traces
Tasks run automatically every two minutes on new data, ensuring your evaluations stay up-to-date without manual intervention.

Prerequisites

Before setting up your first online evaluation, make sure you have:
  • Data source: You need either:
    • Traces flowing into Arize: Your application should be sending traces/spans to Arize. If you haven’t set this up yet, see our Tracing documentation.
    • OR a Dataset with an Experiment: A dataset with an associated experiment to evaluate.
  • LLM Integration configured: You’ll need an LLM provider integration set up for your evaluators. See AI Provider Integrations.

Step-by-Step Setup

Step 1: Navigate to Evaluator Tasks

  1. In the Arize AX interface, navigate to EvaluateOnline Evals in the left sidebar
  2. You’ll see the Evaluator Tasks page, which lists all your existing tasks
Evals And Task

Step 2: Create a New Task and Choose Evaluator Type

  1. Click the New Task button in the top right corner
  2. You’ll see options to choose your evaluator type:
    • LLM-as-a-Judge (Template Evaluator): Use an LLM to evaluate outputs based on a prompt template
    • Code Evaluator: Use custom Python code to evaluate outputs programmatically
💡 Tip: Template evaluators are great for subjective quality measures, while code evaluators excel at objective checks (e.g., checking if a response contains required keywords or follows a specific format).
New Task

Step 3: Enter Task Name

Give your task a descriptive name, such as:
  • Production Quality Check
  • Customer Support Response Evaluation
  • Code Generation Accuracy

Step 4: Add and Configure Evaluators

This is where you define what you want to measure. You can add multiple evaluators to a single task. The options available depend on the evaluator type you chose in Step 2.

If you chose LLM-as-a-Judge (Template Evaluator)

Click the ”+ Add Evaluator” button, then choose one of these options: Option A: Use a Pre-built Evaluator Template
  1. Click ”+ Add Evaluator”Use Template
  2. Browse the available templates:
    • Relevance: Measures if the response is relevant to the input
    • Toxicity: Detects harmful or inappropriate content
    • Helpfulness: Assesses how useful the response is
    • Factual Accuracy: Checks if claims are factually correct
    • And more…
  3. Select a template and click Add
That’s it! Pre-built templates are ready to use with no additional configuration needed. Option A2: Define a Custom Template Evaluator If you want to create a custom evaluator instead of using a pre-built template:
  1. Click ”+ Add Evaluator”Create Custom
  2. Configure the evaluator:
    • Name: Give it a descriptive name
    • Template: Write your evaluation prompt template
    • Rails: Define the possible output values/scoring scale
    • Scope: Choose what to evaluate:
      • Span: Evaluate individual spans
      • Trace: Evaluate entire traces
      • Session: Evaluate across sessions
    • LLM Config: Select which LLM to use for evaluation
Screenshot 2026 01 05 At 9 24 09 AM Option B: Use the Alyx Eval Builder (AI-Powered) Alyx can help you create custom evaluators from plain language descriptions:
  1. Click the ✨ Alyx icon in the upper right corner
  2. Use a prompt like:
    Write a custom evaluation that checks if customer support responses 
    are empathetic, address the customer's concern, and provide actionable 
    next steps. Score from 1-5.
    
  3. Alyx will generate a tailored evaluator template for you
  4. Review and customize the generated evaluator as needed

If you chose Code Evaluator

  1. Click ”+ Add Evaluator”
  2. Write your evaluation logic in Python
  3. Define the expected inputs and outputs
  4. Configure column mappings (which trace/span attributes to use as inputs)
  5. Test your evaluator before saving
Screenshot 2026 01 04 At 10 06 47 PM

Step 5: Configure Target Data

Now that you’ve defined your evaluators, configure where they should run. You can create tasks that evaluate data from:
  • Project: Evaluate traces from a specific model/project
  • Dataset: Evaluate examples from a dataset with associated experiments
For your first task, we recommend starting with a Project to evaluate production traces.

Selecting Project

  1. In the Target Data section, select Project from the first dropdown
  2. Choose the project/model you want to evaluate from the second dropdown
  3. Continue to Step 6 to configure Project-specific settings

Selecting Dataset

  1. In the Target Data section, select Dataset from the first dropdown
  2. Choose the dataset you want to evaluate
  3. Select the experiments associated with that dataset
  4. Skip to Step 8 (Review and Save) - no additional settings needed for datasets

Step 6: Configure Sampling Rate (Project Only)

⚠️ Note: This step only applies when using a Project as your data source. Choose what percentage of your data to evaluate:
  • 100%: Evaluate every trace (useful for low-volume, critical applications)
  • 10-50%: Common for high-volume applications to balance cost and coverage
  • 1-5%: For very high-volume applications where you want representative sampling
You can adjust the sampling rate using the slider or by entering a value directly in the input field.
💡 Tip: Start with a lower sampling rate (10-20%) and increase it once you’ve validated your evaluators are working correctly.

Step 7: Configure Project Filter

⚠️ Note: This step only applies when using a Project as your data source. Filters let you target specific subsets of your data:
  1. In the Project Filter section, click to open the filter selector
  2. Common filters include:
    • Model Name: Only evaluate specific models
    • Span Kind: Only evaluate LLM spans
    • Metadata: Only evaluate spans with certain metadata tags
    • Span Attributes: Filter on any span attribute like name

Step 8: Configure Cadence

⚠️ Note: This step only applies when using a Project as your data source. Choose when your task should run:
  • Run continuously on new incoming data (recommended for production): Task runs automatically every 2 minutes on new incoming data. Best for ongoing monitoring of production systems.
  • Run historically (run once): Task runs on a batch of existing data. Best for evaluating historical data or running one-time assessments.

Step 9: Advanced Settings

⚠️ Note: This step only applies when using a Project as your data source. The Advanced section allows you to override the configuration associated with your evaluators:
  • Select a Model: Override the model configuration for your evaluators (if needed)
  • LLM Parameters: Override advanced LLM settings (temperature, etc.) for your evaluators
This is useful when you want to use different LLM settings at the task level than what’s configured in your individual evaluators. Most users can skip this section and use the evaluator-level configurations.

Step 10: Review and Save

  1. Review your task configuration:
    • Task name
    • Evaluators added
    • Target data (Project or Dataset)
    • For Projects: Sampling rate, filters, and cadence
    • For Datasets: Selected dataset and experiments
  2. Click Create Task
  3. If you selected “Run continuously on new incoming data” for a Project-based task, it will start running automatically within 2 minutes. For “Run historically” or Dataset-based tasks, results will start appearing immediately.