Understanding the Basics
Before we dive into setup, let’s clarify the key concepts you’ll be working with.What is an Evaluator?
An evaluator measures the quality or performance of your LLM application. It takes your application’s output (and optionally inputs) and produces a score or assessment. An evaluator consists of one of two configurations:- Template Evaluators (LLM-as-a-Judge):
- An LLM template (prompt that instructs the LLM how to evaluate)
- Rails (the possible output values/scoring scale)
- Model configuration (which LLM to use for evaluation)
- Code Evaluators:
- Code definition (Python code that performs the evaluation)
- Column mappings (which trace/span attributes to use as inputs)
- Relevance: Does the response answer the question?
- Toxicity: Is the content safe and appropriate?
- Factual Accuracy: Are the claims in the response correct?
- Helpfulness: Is the response useful to the user?
- Your success metrics: Whatever success means for your agent or organization—evaluators can measure any quality dimension that matters to you
💡 Tip: Template evaluators are great for subjective quality measures, while code evaluators excel at objective checks (e.g., checking if a response contains required keywords or follows a specific format).
What is a Task?
A task is an automation that runs your evaluators on incoming production data. Think of it as a scheduled job that:- Continuously evaluates your traces/spans
- Applies filters to target specific data (e.g., only LLM spans)
- Runs evaluators at a specified sampling rate
- Attaches evaluation results directly to your traces
Prerequisites
Before setting up your first online evaluation, make sure you have:- Data source: You need either:
- Traces flowing into Arize: Your application should be sending traces/spans to Arize. If you haven’t set this up yet, see our Tracing documentation.
- OR a Dataset with an Experiment: A dataset with an associated experiment to evaluate.
- LLM Integration configured: You’ll need an LLM provider integration set up for your evaluators. See AI Provider Integrations.
Step-by-Step Setup
Step 1: Navigate to Evaluator Tasks
- In the Arize AX interface, navigate to Evaluate → Online Evals in the left sidebar
- You’ll see the Evaluator Tasks page, which lists all your existing tasks
Step 2: Create a New Task and Choose Evaluator Type
- Click the New Task button in the top right corner
- You’ll see options to choose your evaluator type:
- LLM-as-a-Judge (Template Evaluator): Use an LLM to evaluate outputs based on a prompt template
- Code Evaluator: Use custom Python code to evaluate outputs programmatically
💡 Tip: Template evaluators are great for subjective quality measures, while code evaluators excel at objective checks (e.g., checking if a response contains required keywords or follows a specific format).
Step 3: Enter Task Name
Give your task a descriptive name, such as:Production Quality CheckCustomer Support Response EvaluationCode Generation Accuracy
Step 4: Add and Configure Evaluators
This is where you define what you want to measure. You can add multiple evaluators to a single task. The options available depend on the evaluator type you chose in Step 2.If you chose LLM-as-a-Judge (Template Evaluator)
Click the ”+ Add Evaluator” button, then choose one of these options: Option A: Use a Pre-built Evaluator Template- Click ”+ Add Evaluator” → Use Template
- Browse the available templates:
- Relevance: Measures if the response is relevant to the input
- Toxicity: Detects harmful or inappropriate content
- Helpfulness: Assesses how useful the response is
- Factual Accuracy: Checks if claims are factually correct
- And more…
- Select a template and click Add
- Click ”+ Add Evaluator” → Create Custom
- Configure the evaluator:
- Name: Give it a descriptive name
- Template: Write your evaluation prompt template
- Rails: Define the possible output values/scoring scale
- Scope: Choose what to evaluate:
- Span: Evaluate individual spans
- Trace: Evaluate entire traces
- Session: Evaluate across sessions
- LLM Config: Select which LLM to use for evaluation
Option B: Use the Alyx Eval Builder (AI-Powered)
Alyx can help you create custom evaluators from plain language descriptions:
- Click the ✨ Alyx icon in the upper right corner
-
Use a prompt like:
- Alyx will generate a tailored evaluator template for you
- Review and customize the generated evaluator as needed
If you chose Code Evaluator
- Click ”+ Add Evaluator”
- Write your evaluation logic in Python
- Define the expected inputs and outputs
- Configure column mappings (which trace/span attributes to use as inputs)
- Test your evaluator before saving
Step 5: Configure Target Data
Now that you’ve defined your evaluators, configure where they should run. You can create tasks that evaluate data from:- Project: Evaluate traces from a specific model/project
- Dataset: Evaluate examples from a dataset with associated experiments
Selecting Project
- In the Target Data section, select Project from the first dropdown
- Choose the project/model you want to evaluate from the second dropdown
- Continue to Step 6 to configure Project-specific settings
Selecting Dataset
- In the Target Data section, select Dataset from the first dropdown
- Choose the dataset you want to evaluate
- Select the experiments associated with that dataset
- Skip to Step 8 (Review and Save) - no additional settings needed for datasets
Step 6: Configure Sampling Rate (Project Only)
⚠️ Note: This step only applies when using a Project as your data source. Choose what percentage of your data to evaluate:- 100%: Evaluate every trace (useful for low-volume, critical applications)
- 10-50%: Common for high-volume applications to balance cost and coverage
- 1-5%: For very high-volume applications where you want representative sampling
💡 Tip: Start with a lower sampling rate (10-20%) and increase it once you’ve validated your evaluators are working correctly.
Step 7: Configure Project Filter
⚠️ Note: This step only applies when using a Project as your data source. Filters let you target specific subsets of your data:- In the Project Filter section, click to open the filter selector
- Common filters include:
- Model Name: Only evaluate specific models
- Span Kind: Only evaluate LLM spans
- Metadata: Only evaluate spans with certain metadata tags
- Span Attributes: Filter on any span attribute like name
Step 8: Configure Cadence
⚠️ Note: This step only applies when using a Project as your data source. Choose when your task should run:- Run continuously on new incoming data (recommended for production): Task runs automatically every 2 minutes on new incoming data. Best for ongoing monitoring of production systems.
- Run historically (run once): Task runs on a batch of existing data. Best for evaluating historical data or running one-time assessments.
Step 9: Advanced Settings
⚠️ Note: This step only applies when using a Project as your data source. The Advanced section allows you to override the configuration associated with your evaluators:- Select a Model: Override the model configuration for your evaluators (if needed)
- LLM Parameters: Override advanced LLM settings (temperature, etc.) for your evaluators
Step 10: Review and Save
- Review your task configuration:
- Task name
- Evaluators added
- Target data (Project or Dataset)
- For Projects: Sampling rate, filters, and cadence
- For Datasets: Selected dataset and experiments
- Click Create Task
- If you selected “Run continuously on new incoming data” for a Project-based task, it will start running automatically within 2 minutes. For “Run historically” or Dataset-based tasks, results will start appearing immediately.