Evaluate Your Agent

In the previous guide, you instrumented your chatbot and explored its traces. That works for a handful of test queries - but you can’t read every response yourself. Evaluations solve this. An evaluation is an automated check - either an LLM judging another LLM’s output, or a deterministic code check - that runs on your production data continuously. By the end of this guide, every response will be automatically scored and you’ll be able to filter to find the ones that need attention.

Evaluations overview showing trace list with evaluation score columns and filters

This is Part 2 of the Arize AX Get Started series. You should have completed the Tracing guide first, with traces flowing into your project.

Choose how you want to work

Use Arize Skills to have your coding agent run evaluations from your editor, Alyx for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or Code to run them programmatically.

By Arize Skills
By Alyx
By UI
By Code

Use Arize Skills with your coding agent to create an evaluator, run it on traces as a task, and export spans to inspect failures. Install the skills plugin and follow Set up Arize with AI coding agents for authentication and CLI setup. Then, follow the flow below.

Step 1: Create eval

arize-evaluatorThe skill covers LLM-as-a-Judge evaluators only. In your prompt, say which template you need (for example Hallucination) and which project the evaluator is for. For example, you might say:

Create a groundedness-check evaluator using the Hallucination template for my project. The input column is question, the output column is response, and the context column is retrieved_documents.

Note that templates are a starting point - most teams customize the prompt criteria to match their specific rubric. Once the evaluator is created, you can ask your agent to revise it, such as:

Update the groundedness-check evaluator to only flag hallucinations when the claim contradicts a retrieved document, not when it’s just unsupported.

Terminal showing Claude Code loading the arize-evaluator skill after a prompt to create a groundedness-check evaluator with the Hallucination template, column mapping, and follow-up revision for skyserve-chatbot.

Step 2: Create a task to run your evaluator

arize-evaluatorA task connects an evaluator to your project and defines cadence and sampling. See Run online evals on traces for the full UI and configuration options.For example, you might say:

Set up a task to run groundedness-check continuously on incoming traces.

Coding agent terminal: choosing how an evaluator task runs (backfill, continuous on new spans, or both), then creating the evaluator and task with CLI commands

Step 3: See evaluation results on your traces

arize-traceAfter an eval task has written labels to spans, export failures for triage. See Viewing results for where scores appear in the UI.For example, you might say:

Export spans from my project where groundedness-check failed this week

Use the Arize AX UI in Evaluators and Traces to configure your project’s evals end to end.

Step 1: Understand the two types of evaluations

Arize AX supports two kinds of evaluators. LLM-as-a-Judge evaluators use an LLM to assess quality - great for subjective dimensions like helpfulness or groundedness that are hard to check with code. Code-based evaluators are deterministic Python checks, ideal for objective conditions like empty responses or keyword presence. We’ll focus on LLM-as-a-Judge for this example. For code-based evaluators, see Create evaluators.

Step 2: Create an LLM-as-a-Judge evaluator

This example uses LLM-as-a-Judge only. In the left sidebar, click Evaluators, then New Evaluator. Select LLM-as-a-Judge and choose the Hallucination template. This checks whether the response contains claims not supported by the retrieved context.

Evaluator template selection showing Hallucination, Relevance, and other templates

Give it a name: groundedness-check
Select your LLM provider and model (e.g., OpenAI GPT-4o)
Review the template and customize the criteria to match your rubric, or leave it as-is to get started quickly
Click Create Evaluator

Configuring the groundedness-check evaluator with LLM provider and template

Step 3: Create a task to run your evaluator

An evaluator on its own is just a template. To run it on your data, create a task, an automation that applies your evaluator to incoming traces.

Click New Task and select LLM-as-a-Judge
Click Add Evaluator and select groundedness-check
Set data source to your project, cadence to Run continuously on new incoming data, and sampling to 100%
Map your span attributes to the template variables
Click Create Task

Variable mapping panel showing template variables mapped to span attributes

Running Eval Tasks tab showing the groundedness-check task active

Step 4: See evaluation results on your traces

Wait a couple of minutes, then go back to your project. You’ll see evaluation scores on each trace. Filter by score to find failures and click into any trace to see the evaluator’s label, score, and explanation.

Traces list with evaluation score columns showing factual and hallucinated labels

Trace detail showing hallucinated label with judge explanation

Run this workflow from the Python SDK, TypeScript SDK, or ax CLI. Some features are in alpha or beta - please check individual reference pages for details.

Step	Python SDK	TypeScript SDK	CLI
Create an evaluator	Link	Link	Link
Create a task to run your evaluator	Link	Link	Link
See evaluation results on your traces	Link	Link	Link

Congratulations!

Every response your chatbot generates is now automatically scored for quality. You’ve gone from “I think it’s working” to “I can measure exactly how well it’s working.” Instead of manually reviewing traces, you can filter to just the ones that failed, and you have an explanation of what went wrong. Your evaluations have probably revealed a pattern: some responses score poorly because the chatbot makes claims that aren’t in the policy documents. The system prompt says “be helpful,” but it doesn’t say “only use information from the provided documents.” That’s a prompt problem, and it’s exactly what we’ll fix next. Next up: We’ll walk through how to improve your agent using Arize’s Prompt Playground and Experiments features.

Choose how you want to work

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Understand the two types of evaluations

Step 2: Create an LLM-as-a-Judge evaluator

Step 3: Create a task to run your evaluator

Step 4: See evaluation results on your traces

Congratulations!

Next: Improve Your Agent

Learn more about Evaluations

​Choose how you want to work

​Step 1: Create eval

​Step 2: Create a task to run your evaluator

​Step 3: See evaluation results on your traces

​Step 1: Create eval

​Step 2: Create a task to run your evaluator

​Step 3: See evaluation results on your traces

​Step 1: Understand the two types of evaluations

​Step 2: Create an LLM-as-a-Judge evaluator

​Step 3: Create a task to run your evaluator

​Step 4: See evaluation results on your traces

​Congratulations!

Next: Improve Your Agent

Learn more about Evaluations

Choose how you want to work

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Understand the two types of evaluations

Step 2: Create an LLM-as-a-Judge evaluator

Step 3: Create a task to run your evaluator

Step 4: See evaluation results on your traces

Congratulations!