LLM-as-a-Judge Evaluation for GenAI Use-Cases

hakan tekgul arize team member
Hakan Tekgul,  ML Solutions Engineer  | Published November 11, 2024

Evaluating Large Language Models involves assessing whether their outputs align with established ethical standards, safety requirements, and performance metrics – ensuring they reflect human values and intentions. While certain evaluations focus on testing the model’s accuracy and reliability in specific tasks, others analyze broader aspects like behavioral patterns, inherent biases, and alignment with intended objectives.

Multiple methods exist to measure an LLM application’s effectiveness, ranging from direct user reactions (such as thumbs-up/down ratings, response acceptance/rejection), reference datasets, to commercial indicators (like purchase rates of recommended items). This guide will concentrate on using LLMs as evaluators for specific reasons detailed in the following sections.

What is LLM-as-a-Judge (LaaJ)? 

The concept of “LLM as a judge” describes the practice of using AI to evaluate other AI systems – specifically employing one LLM to assess and explain another LLM’s outputs.

This evaluation approach becomes necessary due to several factors: limited availability of reliable feedback sources, the high cost of human evaluation, and the increasing complexity of LLM applications.

Thankfully, we can leverage LLMs themselves to streamline the evaluation process. We’ll explore the implementation methods and strategies to ensure reliable results.

Although using AI to evaluate AI might appear redundant, it mirrors our existing practice of humans evaluating other humans (such as during recruitment interviews or academic examinations). AI systems have now evolved to perform similar peer assessments.

The methodology involves LLMs creating synthetic benchmarks to evaluate other systems. This raises the question: why not rely solely on human feedback? The answer lies in its scarcity. Securing human feedback for even 1% of input/output interactions is a substantial challenge, which most teams struggle to achieve. In such scenarios, LLM-assisted evaluations provide valuable testing capabilities during development phases. For maximum effectiveness, it’s crucial to evaluate every LLM sub-operation, which can be numerous.

Why is LLM-as-a-Judge an Important Approach? 

LLM as a Judge refers to using LLMs to evaluate the responses of another LLM. It may seem counterintuitive, but it is often easier for an LLM to evaluate an LLM response than to generate that response on its own. This is especially true when the task you present to the Judge LLM is only a piece of the task completed by the original LLM.

Therefore, using LLMs as a Judge to evaluate other LLMs can be a great way to scale the evaluation of the LLM. This approach enables rapid assessment of thousands of responses without relying on human evaluators, significantly reducing both time and cost investments. Moreover, Judge LLMs can provide consistent evaluation criteria across all responses, eliminating the potential variability that comes with multiple human evaluators. The method also allows for detailed feedback and specific scoring across different dimensions of the response, such as accuracy, relevance, clarity, and adherence to given constraints.

The most common LLM evaluation metrics being employed today are evaluations for relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. Evaluation prompt templates are provided to the LLM that aid in the assessment of each of the LLM application’s responses.

Here’s a pre-tested hallucination example:

In this task, you will be presented with a query, a reference text and an answer. The answer is generated to the question based on the reference text. The answer may contain false information. You must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. Your objective is to determine whether the answer text contains factual information and is not a hallucination. A 'hallucination' refers to an answer that is not based on the reference text or assumes information that is not available in the reference text. Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters. "hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text. "Factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information. Please read the query and reference text carefully before determining your response.

# Query: {query}     # Reference text: {reference} # Answer: {response}

A main concern of using LLMs to judge LLMs is knowing how good a specific LLM is at evaluating an LLM for a specific use case. In some cases, pre-tested LLM Judge templates can be found, like those in Arize Phoenix library.

In the absence of pre-tested templates, comparing the responses of an LLM Judge against a golden dataset, aka ground truth data, is typically the best way to measure its efficacy. If you don’t have a golden dataset to use, you can use human labeling to create one. We will go through these methodologies in the following sections. 

What are the Different Data Types of LLM Evals?

There are multiple types of LLM evaluations that can be conducted on LLM applications. Each category of evaluation is categorized by its output type.

Categorical (binary) – The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.

Categorical (Multi-class) – The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.

Score – The evaluation results is a numeric value within a set range (e.g. 1-10), offering a scale of measurement.

Although score evals are sometimes common in the AI space, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

Categorical evals, especially multi-class, strike a balance between simplicity and the ability to convey distinct evaluative outcomes, making them more suitable for applications where precise and consistent decision-making is important.

To explore the full analysis behind our recommendation and understand the limitations of score-based evaluations, check out our research on LLM eval data types.

How do you add Explainability to LLM Evals?

Even though LLM Evals come with many advantages in terms of scaling performance tracking of LLM applications, there is still a big need to understand the reasoning of the Judge LLM for making a decision. Without such visibility, the Judge LLM would simply be a black box. 

Imagine if someone evaluates your writing one day and provides a failing grade. Without any explanation or feedback, how can you improve on your writing? This would also apply in the case of LaaJ, where all LaaJ metrics should provide some sort of explanation on why it thinks the output is hallucinated or toxic. This explanation would act as a guide for you to understand where your LLM application failed so that you can work on improving it. Here is an example of a hallucinated LLM output from a Judge LLM and its explanation: 

What are common LLM-as-a-Judge Evaluation Metrics? 

Now that we’ve covered the essentials of LLM-as-a-Judge and how it operates, let’s explore the crucial part – evaluation metrics! One important thing to understand is that when it comes to evaluating LLM responses, there’s no universal checklist. Each application has its own unique requirements, and ultimately, it’s up to the application owners to determine what matters most for their specific use case.

Let’s look at some real-world examples: an internal HR chatbot helping employees with their questions might prioritize accuracy and relevance over perfect politeness. On the flip side, a chatbot designed for children needs to nail both accurate information and a friendly, appropriate tone.

Or consider a text-to-SQL application – here, you’d naturally want to verify that the SQL syntax is correct, but more importantly, you’ll need to ensure that the generated queries are actually fetching the intended data.

While every application needs its own tailored evaluation approach, there are some common assessment criteria that frequently pop up:

How can LLM-as-a-Judge be used for RAG Applications?

Contextual relevance and faithfulness are two of the most widely-used metrics for assessing the accuracy and relevance of retrieved files of documents when leveraging LLM RAG.

Contextual relevance looks at relevance of the retrieved context to the original query. This can be a binary classification of relevant/irrelevant or ranking metrics can be used (i.e. MRR, Precision@K, MAP, NDCG, etc.). Let’s look at an example eval prompt template for contextual relevance: 

“You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text contains information that can answer the Question. Please focus on whether the very specific question can be answered by the information in the Reference text. Your response must be single word, either “relevant” or “unrelated”, and should not contain any text or characters aside from that word. “unrelated” means that the reference text does not contain an answer to the Question. “relevant” means the reference text contains an answer to the Question.”

Next, faithfulness or groundedness looks at how much the foundation model’s response aligns with retrieved context. For example, this can be a binary classification of faithful or unfaithful. Let’s look at an example that checks for hallucination: 

“In this task, you will be presented with a query, a reference text and an answer. The answer is generated to the question based on the reference text. The answer may contain false information. You must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. Your objective is to determine whether the answer text contains factual information and is not a hallucination. A ‘hallucination’ refers to an answer that is not based on the reference text or assumes information that is not available in the reference text. Your response should be a single word: either “factual” or “hallucinated”, and it should not include any other text or characters. “hallucinated” indicates that the answer provides factually inaccurate information to the query based on the reference text. “factual” indicates that the answer to the question is correct relative to the reference text, and does not contain made up information. Please read the query and reference text carefully before determining your response.

    # Query: {query}

    # Reference text: {reference}

    # Answer: {response}

    Is the answer above factual or hallucinated based on the query and reference text?”

How can you use LLM-as-a-Judge for Agents?

Evaluation is one of the main tools that will help you transform your agent from a simple demo project into a production tool. Using a thoughtful and structured approach to evaluation is one of the easiest ways to streamline this otherwise challenging process.

Let’s explore one such approach. We will cover agent evaluation structures, what you should evaluate your agent on, and techniques for performing those evaluations.

How to structure your agent evaluation process

  1. Break down individual agent steps
  2. Create evaluators for each step
  3. Experiment and iterate

Choosing which steps of your agent to evaluate

First, we split our agent into manageable steps that we want to evaluate individually. These steps should be fairly granular, encompassing individual operations.

Each of your agent’s functions, skills, or execution branches should have some form of evaluation to benchmark their performance. You can get as granular as you’d like. For example, you could evaluate the retrieval step of a RAG skill or the response of an internal API call.

Beyond the skills, it’s critical to evaluate the router on a few axes, which we’ll touch on below. If you’re using a router, this is often where the biggest performance gains can be achieved and router evaluation is key for agent performance tracking. Here is an example agent with different steps to evaluate: 

As you can imagine, the list of “pieces” to evaluate can grow quickly, but that’s not necessarily a bad thing. We recommend starting with many evaluations and trimming down over time, especially if you’re new to agent development.

Building evaluators for each step

With our steps defined, we can now build evaluators for each one. Many frameworks, including Arize’s Phoenix library, can help with this. You can also code your own evaluations, which can be as simple as a string comparison depending on the type of evaluation. We recommend LLM-as-a-judge for evaluation since it is helpful when there is no ground truth or when you’re aiming for more qualitative evaluation. In this step, for each skill, you simply build your own custom evals for use any pre-tested open-source LaaJ eval. 

Evaluating the skill steps of an agent is similar to evaluating those skills outside of the agent. If your agent has a RAG skill, for example, you would still evaluate both the retrieval and response generation steps, calculating metrics like document relevance and hallucinations in the response.

Experiment and iterate

Now that you have the different steps, routers and evaluators defined, you can continuously evaluate your agent on a testing dataset over time as you make changes to the agent itself.  After each major modification, run your test cases through the agent, then run each of your evaluators on the output or traces. This approach is called Evaluation-Driven Development (EDD) and is explained in detail within the following sections. 

How to build your own custom LLM-as-a-Judge Eval 

Let’s discuss how you can build your own LLM-as-a-Judge Eval for your own LLM project. The first step is to build a benchmark for your evaluations. This benchmark would help you validate that your eval is meeting your expectations. 

To do that, you must begin with a metric best suited for your use case. Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback. Building such a dataset is laborious, but you can often find a standardized one for the most common use cases. 

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

Now comes the core component that we are trying to benchmark and improve: the eval template. If you’re using an existing library like OpenAI or Phoenix, you should start with an existing template and see how that prompt performs.

If there is a specific nuance you want to incorporate, adjust the template accordingly or build your own from scratch. Keep in mind that the template should have a clear structure. Be explicit about the following:

  • What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
  • What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
  • What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

How can you validating an LLM Eval?

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1-score, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

If you are not satisfied with the performance of your LLM evaluation template, you need to change the prompt to make it perform better. This is an iterative process informed by hard metrics. As is always the case, it is important to avoid overfitting the template to the golden dataset. Make sure to have a representative holdout set or run a k-fold cross-validation.

Finally, you arrive at your benchmark. The optimized performance on the golden dataset represents how confident you can be on your LLM eval. It will not be as accurate as your ground truth, but it will be accurate enough, and it will cost much less than having a human label in the loop on every example.

Evaluation-Driven Development with LLM-as-a-Judge Evals

If you’re familiar with software engineering, you might know about Test Driven Development (TDD), where developers create tests before building the actual software. This approach ensures continuous improvement while maintaining performance standards as new changes are implemented.

A similar concept has emerged in the LLM space called Evaluation-Driven Development (EDD), which builds upon LLM-as-a-Judge evaluations. The core principle is straightforward: before diving into your LLM application’s MVP, first establish clear evaluation criteria and testing datasets to measure the impact of future modifications. Here’s a breakdown of how EDD works:

  1. Dataset curation: To effectively evaluate your LLM application’s changes, you’ll need a testing dataset. This could be as straightforward as collecting 50-100 typical user interactions and their expected responses.
  2. Experiment Set-Up: With your testing dataset ready, you can run various experiments. Whether you’re tweaking prompts or switching to a different base model, you can test these changes against your dataset to gather experimental results.
  3. Experiment Evaluation: Once your testing framework is in place, you can assess any modifications to your application by running them through the dataset. Using LLM-as-a-Judge evaluation, you can then quantify the accuracy of each experiment.

Let’s look at a practical example: Say you initially built your LLM project using gpt-3.5, and testing showed a 30% hallucination rate (measured using LaaJ methodology). Now you’re considering upgrading to gpt-4. Using EDD, you can test this new version against the same dataset and measure any changes in the hallucination rate. This methodical approach ensures continuous quality improvement throughout your project’s lifecycle.

What are Important Things to Consider for LLM Evaluation?

Offline vs Online Evaluation

Offline LLM evaluation generally happens in-code, with results pushed to the platform and used for testing prompt changes, development tests, and complex data evals for online data. and online means running as data flows. Online LLM evaluation runs as data is received. The former might make sense for testing pipeline, while production use cases are well served by online evaluation. 

With online evaluation, you can also apply production monitoring and alerting to your LLM application in order to detect any issues and minimize the time to resolution (TTR). 

Offline evaluation is generally used during Evaluation-Driven Development, where you run evaluations on a large batch of test data and experiment outputs. Even though both approaches would use the same LaaJ evaluators, it is important to have the right infrastructure in place to support both mechanisms as your LLM projects make it to production. 

Human Validation Loop for LaaJ Improvement

As your LLM projects make it to production and start to show gaps for performance improvement, your LaaJ evaluators will also need to be continuously iterated over time. In order to achieve this, a human in the loop workflow is key, where certain subject matter experts can actually provide human feedback on a sampled set of evaluation results so that you can quantify your Laaj performance over time. 

In order to achieve provide human feedback for your evals, there is a need to have the ability to annotate the evaluations as they get produced. Observability platform like Arize provide the capability to annotate evaluators so that the developers can find examples where LLM evals and humans agree / disagree. 

As those inconsistent human vs. eval examples are produced, you can then take your examples and improve your own testing dataset for evaluation-driven development. You can use metrics like precision/recall to quantify the performance of your LLM evaluators over time. 

Cost and Latency of LLM Evals 

Another important factor to consider for LLM Evaluation is the cost and latency of such evaluations. Since you will be leveraging another LLM to evaluate your own LLM responses, all the evaluations will come with a cost. 

Firstly, even though there is a cost associated with every LaaJ evaluation, it is still estimated to be significantly cheaper than a human coming in and evaluating every LLM response in production. Additionally, the selection of the model for LaaJ evaluation is also important from a cost perspective. Some teams might simply decide to use gpt-4o-mini for evaluation in the case that their output context windows are not too large. Finally, users always have the option to apply a sampling rate and only evaluate a fixed sample of responses coming in from your LLM application. The selection of model and sampling approaches are explained below. 

Which model to use for LaaJ 

It is impossible to say that one model works best for all cases. Instead, you should run model evaluations to understand which model is right for your application. You may also need to consider tradeoffs of recall vs. precision, depending on what makes sense for your application. Again, it is very important to have that human validation in the loop so that you can continuously check if LaaJ evaluators are working as expected. It is always good practice to pick 3-4 different LaaJ models and generate precision/recall values for each on a human-labeled testing dataset.  Then, you can pick the best model for your use-case and continue iterating on the eval prompts in production. Here is a quick video on selecting the best eval model for your application with some example code! 

Sampling Data for LaaJ 

While LLM-as-a-Judge Evaluation involves certain costs, it’s still far more efficient and cost-effective than relying solely on human labeling. The key is finding the right sampling rate that gives you broader coverage than human labeling while keeping LLM API costs manageable.

Let’s consider a real-world scenario: imagine your LLM application handles 1M daily interactions in production. Evaluating every single interaction isn’t financially feasible, and you need to scale down. Say you have a team of 10 annotators who can process about 10k interactions per day. In this case, sampling 10% of your interactions (100k daily) would still give you efficient automated evaluation while examining 10 times more responses than your human team could handle – all at a fraction of the cost.

Conclusion 

LLM-as-a-Judge evaluation represents a significant advancement in how we assess and improve AI systems. While it comes with its own set of challenges, including costs, model selection considerations, and the need for continuous refinement, it offers an efficient and scalable solution for evaluating LLM applications. The key to success lies in finding the right balance: selecting appropriate evaluation metrics for your specific use case, implementing a thoughtful sampling strategy, maintaining human oversight for validation, and following evaluation-driven development practices. As LLM applications continue to evolve and become more complex, having a robust evaluation framework becomes not just beneficial but essential for ensuring quality, reliability, and continuous improvement. By combining automated LLM evaluation with strategic human validation, organizations can build more trustworthy and effective AI systems while managing resources efficiently.