The Definitive Guide to

LLM Evaluation

A practical guide to building and implementing evaluation strategies for AI applications

Overview

Chapter Summary

The introduction emphasizes the importance of evaluating large language model (LLM) applications to ensure they meet user expectations and perform reliably. It highlights the shift from traditional software testing to dynamic, context-sensitive evaluations that account for LLMs’ non-deterministic nature.

Get started with LLM evaluation by following our product documentation, which provides step-by-step guidance to begin implementing effective evaluation strategies.

The Definitive Guide to LLM Evaluation provides a structured approach to building, implementing, and optimizing evaluation strategies for applications powered by large language models (LLMs). As the use of LLMs expands across industries, this guide outlines the tools, frameworks, and best practices necessary to evaluate and improve these systems effectively.

The guide begins by introducing LLM as a Judge, an approach where LLMs assess their own or other models’ outputs. This method automates evaluations, reducing the reliance on costly human annotations while providing scalable and consistent assessments. The foundational discussion then moves to different evaluation types, including token classification and synthetic data evaluation, emphasizing the importance of selecting the right approach for specific use cases.

In pre-production stages, curating high-quality datasets is essential for reliable evaluations. The guide details methods like synthetic data generation, human annotation, and benchmarking LLM evaluation metrics to create robust test datasets. These datasets help establish LLM evaluation benchmarks, ensuring that the evaluation process aligns with real-world scenarios.

For teams integrating LLMs into production workflows, we explore CI/CD testing frameworks, which enable continuous iteration and validation of updates. By incorporating experiments and automated tests into pipelines, teams can maintain stability and performance while adapting to evolving requirements.

As applications move into production, LLM guardrails play a critical role in mitigating risks, such as hallucinations, toxic responses, or security vulnerabilities. This section covers input and output validation strategies, dynamic guards, and few-shot prompting techniques for addressing edge cases and attacks.

Finally, we highlight practical use cases, including RAG evaluation, which focuses on assessing retrieval-augmented generation systems for document relevance and response accuracy, to ensure seamless performance across all components. By combining insights from metrics, AI guardrails, and benchmarks, teams can holistically assess their applications’ performance and ensure alignment with business goals.

This guide provides everything needed to evaluate LLMs effectively, from pre-production dataset preparation to production-grade safeguards and ongoing improvement strategies. It is an essential resource for AI teams aiming to deliver reliable, safe, and impactful LLM-powered solutions.

llm evals approaches
There are many ways to quantify how your LLM application is doing, from user-provided feedback (i.e. thumbs-up/down, accept/reject response), golden datasets, and finally business metrics (i.e. recommended item purchases).

Introduction

Why are LLM Evals Important?

LLMs are an incredible tool for developers and business leaders to create new value for consumers. They make personal recommendations, translate between structured and unstructured data, summarize large amounts of information, and do so much more.

As the applications multiply, so does the importance of measuring the performance of LLM-powered systems.

Developers using LLMs build applications to respond to user queries, transform or generate content, and classify and structure data.

It’s easy to start building an AI application using LLMs because developers no longer have to collect labeled data or train a model. They only need to create a prompt to ask the model for what they want. However, this comes with tradeoffs. LLMs are generalized models that aren’t fine tuned for a specific task. With a standard prompt, these applications demo really well, but in production environments, they often fail in more complex scenarios.

“As the applications multiply, so does the importance of measuring the performance of LLM-powered systems.”

You need a way to judge the quality of your LLM outputs. An example would be judging the quality of these chat outputs on relevance, hallucination %, and latency.

When you adjust your prompts or retrieval strategy, you will know whether your application has improved and by how much using evaluation. The dataset you are evaluating determines how trustworthy generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.

Paradigm Shift: Integration Testing and Unit Testing to LLM Evaluations

While at first glance the shift from traditional software testing methods like integration and unit testing to LLM application evaluations may seem drastic, both approaches share a common goal: ensuring that a system behaves as expected and delivers consistent, reliable outcomes. Fundamentally, both testing paradigms aim to validate the functionality, reliability, and overall performance of an application.

In traditional software engineering:

  • Unit Testing isolates individual components of the code, ensuring that each function works correctly on its own.
  • Integration Testing focuses on how different modules or services work together, validating the correctness of their interactions.

In the world of LLM applications, these goals remain, but the complexity of behavior increases due to the non-deterministic nature of LLMs.

  • Dynamic Behavior Evaluation: Rather than testing isolated code components, LLM evaluations focus on how the application responds to various inputs in real-time, examining not just accuracy but also context relevance, coherence, and user experience.
  • Task-Oriented Assessments: Evaluations are now centered on the application’s ability to complete user-specific tasks, such as resolving queries, generating coherent responses, or interacting seamlessly with external systems (e.g., function calling).

Both paradigms emphasize predictability and consistency, with the key difference being that LLM applications require dynamic, context-sensitive evaluations, as their outputs can vary with different inputs. However, the underlying principle remains: ensuring that the system (whether it’s traditional code or an LLM-driven application) performs as designed, handles edge cases, and delivers value reliably.

 

LLM Eval Types

In this section, we’ll review a number of different ways to approach LLM evaluations: LLM as a Judge, Code based evaluations, and online & offline LLM evaluations.

LLM as a Judge Eval

how llm as a judge works

Often called LLM as a judge, LLM-assisted evaluation uses AI to evaluate AI — with one LLM evaluating the outputs of another and providing explanations.

LLM-assisted evaluation is often needed because user feedback or any other “source of truth” is extremely limited and often nonexistent (even when possible, human labeling is still expensive) and it is easy to make LLM applications complex.

LLM as Judge diagram

Fortunately, we can use the power of LLMs to automate the evaluation. In this eBook, we will delve into how to set this up and make sure it is reliable.

While using AI to evaluate AI may sound circular, we have always had human intelligence evaluate human intelligence (for example, at a job interview or your college finals). Now AI systems can finally do the same for other AI systems.

The process here is for LLMs to generate synthetic ground truth that can be used to evaluate another system. Which begs a question: why not use human feedback directly? Put simply, because you often do not have enough of it.

Getting human feedback on even one percent of your input/output pairs is a gigantic feat. Most teams don’t even get that. In such cases, LLM-assisted evals help you benchmark and test in development prior to production. But in order for this process to be truly useful, it is important to have evals on every LLM sub-call, of which we have already seen there can be many.

You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference]: {context}
    ************
    [Answer]: {sampled_answer}
    [END DATA]
Your response must be a single word, either “correct” or “incorrect”, and should not contain any text or characters aside from that word. “correct” means that the question is correctly and fully answered by the answer. “incorrect” means that the question is not correctly or only partially answered by the answer.

Code-Based Eval

Code-based LLM evaluations are methods that use programming code to assess the performance, accuracy, or behavior of large language models (LLMs). These evaluations typically involve creating automated scripts or CI/CD test cases to measure how well an LLM performs on specific tasks or datasets. A code-based eval is essentially a python or JS/TS unit test.

Code-based evaluation is sometimes preferred as a way to reduce costs as it does not introduce token usage or latency. When evaluating a task such as code generation, a code-based eval will often be the preferred method since it can be hard coded, and follows a set of rules. However, for evaluation that can be subjective, like hallucination, there’s no code evaluator that could provide that label reliably, in which case LLM as a Judge needs to be used.

Common use cases for code-based evaluators include:

LLM Application Testing

Code-based evaluations can test the LLM’s performance at various levels—focusing on ensuring that the output adheres to the expected format, includes necessary data, and passes structured, automated tests.

  • Test Correct Structure of Output: In many applications, the structure of the LLM’s output is as important as the content. For instance, generating JSON-like responses, specific templates, or structured answers can be critical for integrating with other systems.
  • Test Specific Data in Output: Verifying that the LLM output contains or matches specific data points is crucial in domains such as legal, medical, or financial fields where factual accuracy matters.

  • Structured Tests: Automated structured tests can be employed to validate whether the LLM behaves as expected across various scenarios. This might involve comparing the outputs to expected responses or validating edge cases.

Evaluating Your Evaluator

Evaluating the effectiveness of your evaluation strategy ensures that you’re accurately measuring the model’s performance and not introducing bias or missing crucial failure cases. Code-based evaluation for evaluators typically involves setting up meta-evaluations, where you evaluate the performance or validity of the evaluators themselves.

In order to evaluate your evaluator, teams need to create a set of hand annotated test datasets. These test datasets do not need to be large in size, 100+ examples are typically enough to evaluate your evals. In Arize Phoenix, we include test datasets with each evaluator to help validate the performance for each model type.

We recommend this guide for for a more in depth review of how to improve and check your evaluators.

“Evaluating the effectiveness of your evaluation strategy ensures that you’re accurately measuring the model’s performance and not introducing bias or missing crucial failure cases.”

Eval Output Type

Depending on the situation, the evaluation can return different types of results:

  • Categorical (Binary): The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.
  • Categorical (Multi-class): The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.

  • Continuous Score: The evaluation results in a numeric value within a set range (e.g. 1-10), offering a scale of measurement. We don’t recommend using this approach.

  • Categorical Score: A value of either 1 or 0. The categorical score can be pretty useful as you can average your scores but don’t have the disadvantages of continuous range.

Although score evals are an option, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

Categorical evals, especially multi-class, strike a balance between simplicity and the ability to convey distinct evaluative outcomes, making them more suitable for applications where precise and consistent decision-making is important.

class ExampleResult(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using All Inputs")
        return(EvaluationResult(score=score, label=label, explanation=explanation)
        
class ExampleScore(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using A float")
        return 1.0
      
class ExampleLabel(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator label")
        return "good"

Online vs Offline Evaluation

Evaluating LLM applications across their lifecycle requires a two-pronged approach: offline and online. Offline LLM evaluation generally happens during pre-production, and involves using curated or outside datasets to test the performance of your application. Online LLM evaluation runs once your app is in production, and is run on production data. The same evaluator can be used to run online and offline evaluations.

Offline LLM Evaluation

Offline LLM evaluation generally occurs during the development and testing phases of the application lifecycle. It involves evaluating the model or system in controlled environments, isolated from live, real-time data. The primary focus of offline evaluation is pre-deployment validation CI/CD, enabling AI engineers to test the model against a predefined set of inputs (like golden datasets) and gather insights on performance consistency before the model is exposed to real-world scenarios. This process is crucial for:

  • Prompt and Output Validation: Offline tests allow teams to evaluate prompt engineering changes and different model versions before committing them to production. AI engineers can experiment with prompt modifications and evaluate which variants produce the best outcomes across a range of edge cases.

  • Golden Datasets: Evaluating LLMs using golden datasets (high-quality, annotated data) ensures that the LLM application performs optimally in known scenarios. These datasets represent a controlled benchmark, providing a clear picture of how well the LLM application processes specific inputs, and enabling engineers to debug issues before deployment.

  • Pre-production Check: Offline evaluation is well-suited for running CI/CD tests on datasets that reflect complex user scenarios. Engineers can check the results of offline tests and changes prior to pushing those changes to production.

“Having one unified system for both offline and online evaluation allows you to easily use consistent evaluators for both techniques.”

Note: The “offline” part of “offline evaluations” refers to the data that is being used to evaluate the application. In offline evaluations, the data is pre-production data that has been curated and/or generated, instead of production data captured from runs of your application. Because of this, the same evaluator can be used for offline and online evaluations. Having one unified system for both offline and online evaluation allows you to easily use consistent evaluators for both techniques.

Online LLM Evaluation

Online LLM evaluation, by contrast, takes place in real-time, during production. Once the application is deployed, it starts interacting with live data and real users, where performance needs to be continuously monitored. Online evaluation provides real-world feedback that is essential for understanding how the application behaves under dynamic, unpredictable conditions. It focuses on:

  • Continuous Monitoring: Applications deployed in production environments need constant monitoring to detect issues such as degradation in performance, increased latency, or undesirable outputs (e.g., hallucinations, toxicity). Automated online evaluation systems can track application outputs in real time, alerting engineers when specific thresholds or metrics fall outside acceptable ranges.

  • Real-Time Guardrails: LLMs deployed in sensitive environments may require real-time guardrails to monitor for and mitigate risky behaviors like generating inappropriate, hallucinated, or biased content. Online evaluation systems can incorporate these guardrails to ensure the LLM application proactively being protected rather than reactively.

Read More About Online vs Offline Evaluations

Check out our quickstart guide to evaluations including online, offline, and code evaluations with templates.

Choosing Between Online and Offline Evaluation

While it may seem advantageous to apply online evaluations universally, they introduce additional costs in production environments. The decision to use online evaluations should be driven by the specific needs of the application and the real-time requirements of the business. AI engineers can typically group their evaluation needs into three categories: offline evaluation, guardrails, and online evaluation.

  • Offline evaluation: Offline evaluations are used for checking LLM application results prior to releasing to production. Use offline evaluations for CI/CD checks of your LLM application.

    Example: Customer service chatbot where you want to make certain changes to a prompt do not break previously correct responses.

  • Guardrail: AI engineers want to know immediately if something isn’t right and block or revise the output. These evaluations run in real-time and block or flag outputs when they detect that the system is veering off-course.

    Example: An LLM application generates automated responses for a healthcare system. Guardrails check for critical errors in medical advice, preventing harmful or misleading outputs from reaching users in real time.

  • Online evaluation: AI engineers don’t want to block or revise the output, but want to know immediately if something isn’t right. This approach is useful when it’s important to track performance continuously but where it’s not critical to stop the model’s output in real time.

    Example: An LLM application generates personalized marketing emails. While it’s important to monitor and ensure the tone and accuracy are correct, minor deviations in phrasing don’t require blocking the message. Online evaluations flag issues for review without stopping the email from being sent.

What Is the Difference Between LLM Model Evaluation and LLM System Evaluation (AKA Task Evaluations)

LLM model evaluations look at overall macro performance of LLMs at an array of tasks and LLM system evaluations — also referred to as LLM task evaluations — are more system and use-case specific, evaluating components an AI engineer building an LLM app can control (i.e. the prompt template or context).

Since the term “LLM evals” gets thrown around interchangeably, this distinction is sometimes lost in practice. It’s critical to know the difference, however.

Why? Often, teams consult LLM leaderboards and libraries when such benchmarks may not be helpful for their particular use case. Ultimately, AI engineers building LLM apps that plug into several models or frameworks or tools need a way to objectively evaluate everything at highly specific tasks – necessitating system evals that reflect that fact.

“LLM system evaluations — also referred to as LLM task evaluations — are more system and use-case specific, evaluating components an AI engineer building an LLM app can control.”

LLM model evals are focused on the overall performance of the foundational models. The companies launching the original customer-facing LLMs needed a way to quantify their effectiveness across an array of different tasks.

LLM Model Evals diagram
In this case, we are evaluating two different open source foundation models. We are testing the same dataset across the two models and seeing how their metrics, like hellaswag or mmlu, stack up.

LLM Model Evaluation Metrics

One popular library that has LLM model evals is the OpenAI Eval library, which was originally focused on the model evaluation use case. There are many metrics out there, like HellaSwag (which evaluates how well an LLM can complete a sentence), TruthfulQA (measuring truthfulness of model responses), and MMLU (which measures how well the LLM can multitask). There’s even a leaderboard that looks at how well the open-source LLMs stack up against each other.

LLM system evaluation, also sometimes referred to as LLM task evaluation, is the complete evaluation of components that you have control of in your system. The most important of these components are the prompt (or prompt template) and context. LLM system evals assess how well your inputs can determine your outputs.

LLM system evaluation may, for example, hold the LLM constant and change the prompt template. Since prompts are more dynamic parts of your system, this evaluation makes a lot of sense throughout the lifetime of the project. For example, an LLM can evaluate your chatbot responses for usefulness or politeness, and the same eval can give you information about performance changes over time in production.

LLM System Evals diagram
In this case, we are evaluating two different prompt templates on a single foundational model. We are testing the same dataset across the two templates and seeing how their metrics like precision and recall stack up.

When To Use LLM System Evaluations versus LLM Model Evaluations: It Depends On Your Role

There are distinct personas who make use of LLM evaluations. One is the model developer or an engineer tasked with fine-tuning the core LLM, and the other is the practitioner assembling the user-facing system.

There are very few LLM model developers, and they tend to work for places like OpenAI, Anthropic, Google, Meta, and elsewhere. Model developers care about LLM model evals, as their job is to deliver a model that caters to a wide variety of use cases.

For ML practitioners, the task also starts with model evaluation. One of the first steps in developing an LLM system is picking a model (i.e. GPT 3.5 vs 4 vs Palm, etc.). The LLM model eval for this group, however, is often a one-time step. Once the question of which model performs best in your use case is settled, the majority of the rest of the application’s lifecycle will be defined by LLM system evals. Thus, practitioners care about both LLM model evals and LLM system evals but likely spend much more time on the latter.

LLM System Evaluation Metrics

Having worked with other systems, your first question is likely this: “What should the outcome metric be?” The answer depends on what you are trying to evaluate.

  • Extracting structured information: You can look at how well the LLM extracts information. For example, you can look at completeness (is there information in the input that is not in the output?).
  • Question answering: How well does the system answer the user’s question? You can look at the accuracy, politeness, or brevity of the answer—or all of the above.
  • Retrieval Augmented Generation (RAG): Are the retrieved documents and final answer relevant?

As a system designer, you are ultimately responsible for system performance, and so it is up to you to understand which aspects of the system need to be evaluated. For example, If you have an LLM interacting with children, like a tutoring app, you would want to make sure that the responses are age-appropriate and are not toxic.

What Are the Top LLM System Evaluation Metrics?

The most common LLM evaluation metrics being employed today are relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. Each one of these LLM system evals will have different templates based on what you are trying to evaluate. A fuller list of LLM system evaluation metrics appear below.

Type Description Example Metrics
Diversity Examines the versatility of foundation models in responding to different types of queries Fluency, Perplexity, ROUGE scores
User Feedback Goes beyond accuracy to look at response quality in terms of coherence and usefulness Coherence, Quality, Relevance
Ground Truth-Based Metrics Compares a RAG system’s responses to a set of predefined, correct answers Accuracy, F1 score, Precision, Recall
Answer Relevance How relevant the LLM’s response is to a given user’s query. Binary classification (Relevant/Irrelevant)
QA Correctness Based on retrieved data, is an answer to a question correct? Binary classification (Correct/Incorrect)
Hallucinations Looking at LLM hallucinations with regard  to retrieved context Binary classification (Factual/Hallucinated)
Toxicity Are responses racist, biased, or toxic? Disparity Analysis, Fairness Scoring, Binary classification (Non-Toxic/Toxic)

What are you Evaluating?

When evaluating LLM applications, the primary focus is on three key areas: the task, historical performance, and golden datasets. Task-level evaluation ensures that the application is performing well on specific use cases, while historical traces provide insight into how the application has evolved over time. Meanwhile, golden datasets act as benchmarks, offering a consistent way to measure performance against well-established ground truth data.