Skip to main content
In the previous guide, you instrumented your chatbot and explored its traces in Arize AX. You found a wrong answer by clicking through traces and reading the outputs. That works for a handful of test queries — but your chatbot handles hundreds of requests a day. You can’t read every response yourself. What you need is an automated way to score every response for quality. Did it answer the question? Did it stick to the policy documents, or did it make things up? Is the response actually helpful? Evaluations solve this. An evaluation is an automated check — either an LLM judging the quality of another LLM’s output, or a deterministic code check — that runs on your production data. Arize AX can run evaluations continuously, scoring every trace as it comes in. By the end of this guide, every response your chatbot generates will be automatically scored, and you’ll be able to filter your traces to instantly find the ones that need attention.
This is Part 2 of the Arize AX Get Started series. You should have completed the Tracing guide first, with traces flowing into your skyserve-chatbot project.

Step 1: Understand the two types of evaluations

Arize AX supports two kinds of evaluators: LLM-as-a-Judge evaluators use an LLM to assess quality. You provide a prompt template that tells the judge what to look for (helpfulness, groundedness, tone, etc.), and it scores each response. These are great for subjective quality dimensions that are hard to check with code. Code-based evaluators are deterministic checks written in Python. They’re ideal for objective conditions — like checking that a response isn’t empty, contains required keywords, or parses as valid JSON. We’ll set up one of each.

Step 2: Create an LLM-as-a-Judge evaluator

We’ll create an evaluator that checks whether the chatbot’s responses are grounded in the retrieved documents — meaning the chatbot isn’t making up information that isn’t in the policy docs.
1

Navigate to Evaluators

In the left sidebar, click Evaluators. Then click New Evaluator in the top right.
2

Choose a pre-built template

Select LLM-as-a-Judge, then browse the pre-built templates. Choose Hallucination — this template checks whether the LLM’s response contains information that isn’t supported by the provided context.
Evaluator template selection showing Hallucination, Relevance, and other templates
3

Configure the evaluator

  • Give it a name like groundedness-check
  • Select your LLM provider and model for the judge (e.g., OpenAI GPT-4o)
  • Review the template — it instructs the judge to compare the response against the reference context and flag any unsupported claims
Configuring the groundedness-check evaluator with LLM provider and template
Click Save Evaluator to add it to your Eval Hub.

Step 3: Create a task to run your evaluator

An evaluator on its own is just a template. To run it on your data, you need to create a task — an automation that applies your evaluator to incoming traces.
1

Create a new task

From the Evaluators page, click New Task. Select LLM-as-a-Judge.
2

Add your evaluator

Click Add Evaluator and select the groundedness-check evaluator you just created from the Eval Hub.
3

Configure the data source

  • Data source: Select your skyserve-chatbot project
  • Cadence: Choose Run continuously on new incoming data — this means the task will check for new traces every 2 minutes
  • Sampling rate: Set to 100% for now (you can lower this later for high-volume production use)
4

Map the variables

The hallucination template expects variables like input, question, and output. Map these to the corresponding span attributes in your traces — for example, input to attributes.input.value, question to attributes.input.value, and output to attributes.output.value.
Variable mapping panel showing template variables mapped to span attributes
5

Create the task

Click Create Task. Your evaluator is now running! Navigate to the Running Tasks tab to confirm it’s active.
Running Eval Tasks tab showing the groundedness-check task active

Step 4: Add a code-based evaluator

Let’s also add a quick deterministic check to catch empty or very short responses — a common failure mode.
1

Create a code evaluator task

Click New Task again, and this time select Code Evaluator.
2

Choose a template

From the pre-built code evaluator templates, select Contains any Keyword. Configure it to check for keywords that indicate a non-answer, like “I don’t know” or “I’m not sure.”Alternatively, you can write a custom check. For example, flag responses shorter than 20 characters as likely failures.
Code evaluator configuration showing Contains any Keyword template
3

Configure and create

Set the data source to your skyserve-chatbot project and create the task.

Step 5: See evaluation results on your traces

Wait a couple of minutes for the evaluation tasks to run. Then go back to your skyserve-chatbot project and look at the traces. You’ll now see evaluation scores attached to each trace. Each trace shows whether it passed or failed the groundedness check, along with the judge’s explanation.
Traces list with evaluation score columns showing factual and hallucinated labels

Filter to find problems

Click on the evaluation column header to sort by score. Click into any trace to see the full evaluation detail — each evaluator shows its label, score, and an explanation of its reasoning. For traces that fail the groundedness check, the explanation will tell you exactly what the chatbot said that wasn’t grounded in the context.
Trace detail showing hallucinated label with judge explanation
This is far more efficient than manually reading through every response. The evaluator reads them all for you and flags the ones that need attention.

Congratulations!

Every response your chatbot generates is now automatically scored for quality. You’ve gone from “I think it’s working” to “I can measure exactly how well it’s working.” Instead of manually reviewing traces, you can filter to just the ones that failed — and you have an explanation of what went wrong. Your evaluations have probably revealed a pattern: some responses score poorly because the chatbot makes claims that aren’t in the policy documents. The system prompt says “be helpful,” but it doesn’t say “only use information from the provided documents.” That’s a prompt problem — and it’s exactly what we’ll fix next. Next up: We’ll use the Prompt Playground to iterate on your prompt using real production data, then save the improved version to Prompt Hub for version control.

Next: Fix What's Broken

Learn more about Evaluations