Follow with Complete Python Notebook
LLM as a Judge Evaluators
LLM as a Judge evaluators use an LLM to assess output quality. These are particularly useful when correctness is hard to encode with rules, such as evaluating relevance, helpfulness, reasoning quality, or actionability. These evaluators use criteria you define, making them suitable for datasets with or without reference outputs.LLM as a Judge Evaluator for Overall Agent Performance
This experiment evaluates the overall performance of the support agent using an LLM as a Judge evaluator. This allows us to assess subjective qualities like actionability and helpfulness that are difficult to measure with code-based evaluators.Define the Task Function
The task function is what the experiment calls for each example in your dataset. It receives the dataset row and returns an output that will be evaluated. In this example, our task function extracts the query from the dataset row, runs the full support agent (which includes tool calls and reasoning), and returns the agent’s response:Define the LLM as a Judge Evaluator
We use the open-source Phoenix Evals library to define our evaluators. It’s built for fast LLM-based evaluation and is convenient to use with any model SDK. We create an LLM as a Judge evaluator that assesses whether the agent’s response is actionable and helpful. The evaluator uses a prompt template that defines the criteria for a good response:Run the Experiment
Run the experiment on your dataset.- Complete agent traces let you drill into any run to see the exact inputs, agent reasoning, tool calls, and response. This is useful for understanding agent behavior and debugging when an example scores poorly.
- Scores and labels per example show which inputs the LLM Judge rated highly or poorly, so you can spot patterns and prioritize where to improve.
- Evaluator explanation tells you why the judge gave each score so you can fix specific failure modes.
- Aggregate metrics across the run let you compare experiments over time and track whether quality is improving.