LLM as a Judge
Judge LLM outputs using LLMs
LLM-as-a-Judge is an evaluation approach that uses an LLM to assess the quality of another model’s outputs. LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.
You can run thousands of evaluations across curated datasets without the need for human intervention. This creates a scalable form of evaluation using only scoring or classification prompts to measure performance.
Arize uses the Phoenix Evals library which is designed for simple, fast, and accurate LLM-based evaluations.
Components of an LLM Eval
Guide to Creating an Eval
Error analysis to understand your data
Wherever you notice the highest frequency of problematic traces is a good place to start building evaluations. For example, if you see repeated user messages expressing confusion or dissatisfaction, you might create an evaluation to detect user frustration and measure its frequency.
Choose input data to evaluate
Depending on what you want to measure or critique, your evaluation input data may vary. It can include the application’s inputs, outputs, metadata, and/or prompt variables.
When selecting data, you can pull from existing traces or isolate specific spans within those traces that are most relevant to the behavior you want to evaluate. This flexibility lets you target evaluations at a narrow scope or apply them broadly across the entire workflow.
Define an evaluator or choose a predefined one
The prompt template is where you specify your evaluation criteria, input data, and desired output labels from the LLM Judge (ex: correct/incorrect) .
Arize AX provides many pre-built templates for common evaluation scenarios, including hallucination detection, agent planning quality, and function calling accuracy. You can also define a custom evaluation template or ask Alyx to generate one automatically based on your use case.
Arize Eval Templates
If you don’t want to start from scratch, Arize has predefined evaluation templates. These prompts are tested against benchmarked datasets.
These are built into Phoenix Evals and are an easy way to get reliable evals up and running fast. You can access these templates directly when creating an evaluation in the Arize AX UI, or use them programmatically in code.
Custom Eval Templates
Why Use Custom Eval Templates?
Custom evaluation criteria and prompt templates let you measure what actually matters for your application—going beyond what generic templates can assess. For example, you might create a custom eval to check for regulatory compliance, tone consistency, or task completion accuracy.
In the guide below, we walk through how to build three types of custom LLM-as-a-Judge evaluators:
Categorical Classification Evaluator – for labelling outputs (ex: “Compliant” vs. “Non-compliant”).
Numeric Classification Evaluator – for scoring responses (ex: rating helpfulness from 1-10).
Fully Custom LLM Evaluator – for more complex evaluations such as multi-step reasoning or domain-specific accuracy.
Last updated
Was this helpful?

