Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.
Prefer to use code? See the SDK guide
Step 1: Create an Evaluation Task
Start by creating a task to define what data you want to evaluate.- Navigate to New Eval Task in the upper right-hand corner
- Click LLM-as-a-Judge
- Give your task a name (ex: “Travel Plan Completeness”)
- Set the Cadence to Run on historical data so we can evaluate our existing traces
Step 2: Create a New Evaluator from Blank
Press Add Evaluator, then select Create New. Instead of choosing a pre-built template, select Create From Blank. This gives you full control over the evaluation prompt, labels, and judge model.Step 3: Define the Evaluator
Building a custom evaluator involves four configuration steps:- Name the evaluator Give it a descriptive name. This name appears in the Evaluator Hub and in your eval results, so choose something that clearly communicates what the evaluator measures.
-
Pick a model Select the LLM that will serve as the judge — the model that reads each trace and applies your evaluation criteria.
- Select an AI Provider (ex: OpenAI, Azure OpenAI, Bedrock, etc) & enter your credentials for configuration.
- Once an AI Provider is configured, choose a model (be sure that the model chosen is different than what we used for the agent).
-
Define a template The prompt template is the core of your evaluator. It should clearly describe the judge’s role, the criteria for each label (when to use “correct” vs. “incorrect”), and the data it will see — marked with template variables like
{{input}}and{{output}}. Here’s the custom template for our travel agent. It encodes specific expectations about essential info, budget, and local experiences — criteria that a generic template wouldn’t capture: -
Define labels Output labels constrain what the judge can return. For this evaluator, define two labels:
- correct (score: 1) — the response meets all criteria
- incorrect (score: 0) — the response fails one or more criteria
- Explanations: Toggle Explanations to “On” to have the judge provide a brief rationale for each label, which helps with debugging and understanding why an example was scored correct or incorrect.
Step 4: Task Configuration
With the evaluator defined, configure how it connects to your trace data.- Set the scope to Trace. Because we are evaluating the agent’s overall performance for one call, trace-level is the right granularity.
-
Map your trace attributes to the template’s variables. These mappings tell the evaluator which trace fields to pass into the judge model:
{{input}}←attributes.input.value(the user’s travel planning query){{output}}←attributes.output.value(the agent’s trip plan)
Step 5: Run the Evaluation
With everything configured, save the evaluator and run the task. Custom evals use the same execution flow as pre-built evals — results appear alongside your traces once the judge model finishes processing.Step 6: Inspect Results
Review the evaluation results to understand where your agent succeeds and where it falls short:- Filter by label to focus on “incorrect” responses and identify patterns — are most failures missing budgets? Giving generic recommendations?
- Read explanations to understand the judge’s reasoning for each score
- Compare with pre-built eval results if you ran a pre-built template earlier — custom evals often surface issues that generic templates miss
Reusing Evaluators in the Evaluator Hub
The evaluators you create are saved to the Evaluator Hub, making them reusable across tasks and projects. You can find all your evaluators — both pre-built and custom — by navigating to the Evaluators section from the left sidebar. From the Evaluator Hub you can:- Attach existing evaluators to new evaluation tasks without recreating them
- Version your evaluators — track changes over time with commit messages so you know what changed and why
- Share across projects — apply the same quality criteria to different applications