LLM as a Judge

Judge LLM outputs using LLMs

LLM-as-a-Judge is an evaluation approach that uses an LLM to assess the quality of another model’s outputs. LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.

You can run thousands of evaluations across curated datasets without the need for human intervention. This creates a scalable form of evaluation using only scoring or classification prompts to measure performance.

Arize uses the Phoenix Evals library which is designed for simple, fast, and accurate LLM-based evaluations.

Components of an LLM evaluation

Choose a metric

Wherever you notice the highest frequency of problematic traces is a good place to start building evaluations. For example, if you see repeated user messages expressing confusion or dissatisfaction, you might create an evaluation to detect user frustration and measure its frequency.

Choose input data to evaluate

Depending on what you want to measure or critique, your evaluation input data may vary. It can include the application’s inputs, outputs, metadata, and/or prompt variables.

When selecting data, you can pull from existing traces or isolate specific spans within those traces that are most relevant to the behavior you want to evaluate. This flexibility lets you target evaluations at a narrow scope or apply them broadly across the entire workflow.

Define an eval prompt template

The prompt template is where you specify your evaluation criteria, input data, and desired output labels from the LLM Judge (ex: correct/incorrect) .

Arize provides many pre-built templates for common evaluation scenarios, including hallucination detection, agent planning quality, and function calling accuracy. You can also define a custom evaluation template or ask Alyx to generate one automatically based on your use case.

Run eval & understand results

When the evaluation runs, each row of input data is used to populate the variables in the prompt template.

The LLM Judge then executes the evaluation prompt and returns an output label, a numerical score, and (optionally) an explanation.

Last updated 7 days ago

Was this helpful?