This is Part 2 of the Arize AX Get Started series. You should have completed the Tracing guide first, with traces flowing into your skyserve-chatbot project.
Step 1: Understand the two types of evaluations
Arize AX supports two kinds of evaluators: LLM-as-a-Judge evaluators use an LLM to assess quality. You provide a prompt template that tells the judge what to look for (helpfulness, groundedness, tone, etc.), and it scores each response. These are great for subjective quality dimensions that are hard to check with code. Code-based evaluators are deterministic checks written in Python. They’re ideal for objective conditions — like checking that a response isn’t empty, contains required keywords, or parses as valid JSON. We’ll set up one of each.Step 2: Create an LLM-as-a-Judge evaluator
We’ll create an evaluator that checks whether the chatbot’s responses are grounded in the retrieved documents — meaning the chatbot isn’t making up information that isn’t in the policy docs.Navigate to Evaluators
In the left sidebar, click Evaluators. Then click New Evaluator in the top right.
Choose a pre-built template
Select LLM-as-a-Judge, then browse the pre-built templates. Choose Hallucination — this template checks whether the LLM’s response contains information that isn’t supported by the provided context.

Configure the evaluator
- Give it a name like
groundedness-check - Select your LLM provider and model for the judge (e.g., OpenAI GPT-4o)
- Review the template — it instructs the judge to compare the response against the reference context and flag any unsupported claims

Step 3: Create a task to run your evaluator
An evaluator on its own is just a template. To run it on your data, you need to create a task — an automation that applies your evaluator to incoming traces.Add your evaluator
Click Add Evaluator and select the
groundedness-check evaluator you just created from the Eval Hub.Configure the data source
- Data source: Select your skyserve-chatbot project
- Cadence: Choose Run continuously on new incoming data — this means the task will check for new traces every 2 minutes
- Sampling rate: Set to 100% for now (you can lower this later for high-volume production use)
Map the variables
The hallucination template expects variables like 
input, question, and output. Map these to the corresponding span attributes in your traces — for example, input to attributes.input.value, question to attributes.input.value, and output to attributes.output.value.
Step 4: Add a code-based evaluator
Let’s also add a quick deterministic check to catch empty or very short responses — a common failure mode.Choose a template
From the pre-built code evaluator templates, select Contains any Keyword. Configure it to check for keywords that indicate a non-answer, like “I don’t know” or “I’m not sure.”Alternatively, you can write a custom check. For example, flag responses shorter than 20 characters as likely failures.

Step 5: See evaluation results on your traces
Wait a couple of minutes for the evaluation tasks to run. Then go back to your skyserve-chatbot project and look at the traces. You’ll now see evaluation scores attached to each trace. Each trace shows whether it passed or failed the groundedness check, along with the judge’s explanation.
Filter to find problems
Click on the evaluation column header to sort by score. Click into any trace to see the full evaluation detail — each evaluator shows its label, score, and an explanation of its reasoning. For traces that fail the groundedness check, the explanation will tell you exactly what the chatbot said that wasn’t grounded in the context.
