LLM as a Judge

Judge LLM outputs using LLMs

LLM-as-a-Judge is an evaluation approach that uses an LLM to assess the quality of another model’s outputs. LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.

You can run thousands of evaluations across curated datasets without the need for human intervention. This creates a scalable form of evaluation using only scoring or classification prompts to measure performance.

Arize uses the Phoenix Evals library which is designed for simple, fast, and accurate LLM-based evaluations.

Components of an LLM Eval

Guide to Creating an Eval

Error analysis to understand your data

Wherever you notice the highest frequency of problematic traces is a good place to start building evaluations. For example, if you see repeated user messages expressing confusion or dissatisfaction, you might create an evaluation to detect user frustration and measure its frequency.

Choose input data to evaluate

Depending on what you want to measure or critique, your evaluation input data may vary. It can include the application’s inputs, outputs, metadata, and/or prompt variables.

When selecting data, you can pull from existing traces or isolate specific spans within those traces that are most relevant to the behavior you want to evaluate. This flexibility lets you target evaluations at a narrow scope or apply them broadly across the entire workflow.

Define an evaluator or choose a predefined one

The prompt template is where you specify your evaluation criteria, input data, and desired output labels from the LLM Judge (ex: correct/incorrect) .

Arize AX provides many pre-built templates for common evaluation scenarios, including hallucination detection, agent planning quality, and function calling accuracy. You can also define a custom evaluation template or ask Alyx to generate one automatically based on your use case.

Run eval & understand results

When the evaluation runs, each row of input data is used to populate the variables in the prompt template.

The LLM Judge then executes the evaluation prompt and returns an output label, a numerical score, and (optionally) an explanation.

Arize Eval Templates

If you don’t want to start from scratch, Arize has predefined evaluation templates. These prompts are tested against benchmarked datasets.

These are built into Phoenix Evals and are an easy way to get reliable evals up and running fast. You can access these templates directly when creating an evaluation in the Arize AX UI, or use them programmatically in code.

Hallucination

Q&A on Retrieved Data

Reference Link Correctness

Custom Eval Templates

Why Use Custom Eval Templates?

Custom evaluation criteria and prompt templates let you measure what actually matters for your application—going beyond what generic templates can assess. For example, you might create a custom eval to check for regulatory compliance, tone consistency, or task completion accuracy.

In the guide below, we walk through how to build three types of custom LLM-as-a-Judge evaluators:

Categorical Classification Evaluator – for labelling outputs (ex: “Compliant” vs. “Non-compliant”).
Numeric Classification Evaluator – for scoring responses (ex: rating helpfulness from 1-10).
Fully Custom LLM Evaluator – for more complex evaluations such as multi-step reasoning or domain-specific accuracy.

Custom LLM Evaluators | Arize Phoenixarize.com

Last updated 1 month ago

Was this helpful?