LLM as a Judge

LLM as a Judge is a general evaluation concept that applies to both evaluation approaches in Phoenix. You can use it via the SDK (client-side) or configure LLM evaluators directly in the Phoenix UI (server-side).

For instance, an AI assistant’s answer to a question can be:

not grounded in context
repetitive, repetitive, repetitive
grammatically incorrect
excessively lengthy and characterized by an overabundance of words
incoherent

The list of criteria goes on. And even if we had a limited list, each of these would be hard to measure To overcome this challenge, the concept of “LLM as a Judge” employs an LLM to evaluate another’s output, combining human-like assessment with machine efficiency.

How It Works

Here’s the step-by-step process for using an LLM as a judge:

Identify Evaluation Criteria

First, determine what you want to evaluate, be it faithfulness, toxicity, accuracy, or another characteristic. See our pre-built evaluators for examples of what can be assessed.

Craft Your Evaluation Prompt

Write a prompt template that will guide the evaluation. This template should clearly define what variables are needed from both the initial prompt and the LLM’s response to effectively assess the output.

Select an Evaluation LLM

Choose the most suitable LLM from our available options for conducting your specific evaluations.

Generate Evaluations and View Results

Execute the evaluations across your data. This process allows for comprehensive testing without the need for manual annotation, enabling you to iterate quickly and refine your LLM’s prompts.

Using an LLM as a judge significantly enhances the scalability and efficiency of the evaluation process. By employing this method, you can run thousands of evaluations across curated data without the need for human annotation. This capability will not only speed up the iteration process for refining your LLM’s prompts but will also ensure that you can deploy your models to production with confidence.

Using LLM as a Judge in Phoenix

SDK Evaluations

Write custom LLM evaluators in Python or TypeScript. See also: Configuring the LLM for model selection and prompt setup.

Server-Side Evaluators

Configure LLM evaluators in the Phoenix UI — no local code or API key setup required.

Additional Resources

Arize AI

LLM as a Jury:How To Implement

Arize AI

Overview Overview

⌘I

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

LLM as a Judge

How It Works

Using LLM as a Judge in Phoenix

SDK Evaluations

Server-Side Evaluators

Additional Resources

LLM as a Judge

LLM as a Jury:How To Implement

​How It Works

​Using LLM as a Judge in Phoenix

SDK Evaluations

Server-Side Evaluators

​Additional Resources

LLM as a Judge

LLM as a Jury:How To Implement

How It Works

Using LLM as a Judge in Phoenix

Additional Resources