Skip to main content

Follow with Complete Python Notebook

In this section, you’ll run a repeatable experiment that uses an LLM-as-a-Judge to score agent outputs on specific and subjective criteria. These evaluations are well suited for cases where ground truth is unavailable or where quality expectations can be clearly defined in a prompt.

LLM as a Judge Evaluators

LLM as a Judge evaluators use an LLM to assess output quality. These are particularly useful when correctness is hard to encode with rules, such as evaluating relevance, helpfulness, reasoning quality, or actionability. These evaluators use criteria you define, making them suitable for datasets with or without reference outputs.

LLM as a Judge Evaluator for Overall Agent Performance

This experiment evaluates the overall performance of the support agent using an LLM as a Judge evaluator. This allows us to assess subjective qualities like actionability and helpfulness that are difficult to measure with code-based evaluators.

Define the Task Function

The task function is what the experiment calls for each example in your dataset. It receives the dataset row and returns an output that will be evaluated. In this example, our task function extracts the query from the dataset row, runs the full support agent (which includes tool calls and reasoning), and returns the agent’s response:
def support_agent_task(dataset_row):
    """
    Task function that will be run on each row of the dataset.
    """
    query = dataset_row.get("query")

    # Call the agent with the query
    response = support_agent.run(query)
    return response.content

Define the LLM as a Judge Evaluator

We use the open-source Phoenix Evals library to define our evaluators. It’s built for fast LLM-based evaluation and is convenient to use with any model SDK. We create an LLM as a Judge evaluator that assesses whether the agent’s response is actionable and helpful. The evaluator uses a prompt template that defines the criteria for a good response:
from phoenix.evals import LLM, create_classifier
from arize.experiments import EvaluationResult

# Define Prompt Template
support_response_actionability_judge = """
You are evaluating a customer support agent's response.

Determine whether the response is ACTIONABLE and helps resolve the user's issue.

Mark the response as CORRECT if it:
- Directly addresses the user's specific question
- Provides concrete steps, guidance, or information
- Clearly routes the user toward a solution

Mark the response as INCORRECT if it:
- Is generic, vague, or non-specific
- Avoids answering the question
- Provides no clear next steps
- Deflects with phrases like "contact support" without guidance

User Query:
{input}

Agent Response:
{output}

Return only one label: "correct" or "incorrect".
"""

# Create Evaluator using a different model than the agent
actionability_judge = create_classifier(
    name="actionability-judge",
    prompt_template=support_response_actionability_judge,
    llm=LLM(model="claude-3-5-haiku-20241022", provider="anthropic"),
    choices={"correct": 1.0, "incorrect": 0.0},
)


def call_actionability_judge(dataset_row, output):
    """
    Wrapper function for the actionability judge evaluator.
    This is needed because run_experiment expects a function, not an evaluator object.
    """
    results = actionability_judge.evaluate(
        {"input": dataset_row.get("query"), "output": output}
    )
    result = results[0]
    return EvaluationResult(
        score=result.score,
        label=result.label,
        explanation=result.explanation,
    )

Run the Experiment

Run the experiment on your dataset.
experiment, experiment_df = client.experiments.run(
    name="support agent performance",
    dataset_id=dataset_id,
    task=support_agent_task,
    evaluators=[call_actionability_judge],
)
In the Arize AX UI, you can click into the experiment to inspect the results:
  • Complete agent traces let you drill into any run to see the exact inputs, agent reasoning, tool calls, and response. This is useful for understanding agent behavior and debugging when an example scores poorly.
  • Scores and labels per example show which inputs the LLM Judge rated highly or poorly, so you can spot patterns and prioritize where to improve.
  • Evaluator explanation tells you why the judge gave each score so you can fix specific failure modes.
  • Aggregate metrics across the run let you compare experiments over time and track whether quality is improving.

Next Steps

Now that you know how to run experiments with LLM as a Judge evaluators, you can also use code-based evaluators when you have ground truth available.

Run Experiments with Code Evals

Iterating with Experiments in Your Workflow