LLM Evaluation: Everything You Need To Run, Benchmark LLM Evals

Aparna Dhinakaran,  Co-founder & Chief Product Officer  | Published January 01, 2024

This piece is co-authored by senior machine learning engineer Ilya Reznik

Large language models (LLMs) are an incredible tool for developers and business leaders to create new value for consumers. They make personal recommendations, translate between structured and unstructured data, summarize large amounts of information, and do so much more.

As the applications multiply, so does the importance of measuring the performance of LLM-powered systems.

LLM Evaluation

LLM evaluation refers to the discipline of ensuring a language model’s outputs are consistent with the desired ethical, safety, and performance criteria — ultimately aligning with human values and intents. Some LLM evaluations look at the model’s ability to perform specific tasks accurately and reliably, while others measure overall behavior, biases, and adherence to alignment objectives.

This piece explores the essentials of LLM evaluation, LLM evaluation metrics, and a concrete exercise with everything you need to get started.

Approaches

llm evals approaches

There are many ways to quantify how your LLM application is doing, from user-provided feedback (i.e. thumbs-up/down, accept/reject response), golden datasets, and finally business metrics (i.e. recommended item purchases). This article focuses on LLM as a judge for reasons specified below.

Why Use LLM-Assisted Evaluation?

how llm as a judge works

LLM As a Judge

Often called LLM as a judge, LLM-assisted evaluation uses AI to evaluate AI — with one LLM evaluating the outputs of another and providing explanations.

LLM assisted evaluation is often needed because user feedback or any other “source of truth” is extremely limited and often nonexistent; even when possible, human labeling is still expensive; and it is easy to make LLM applications complex.

Fortunately, we can use the power of LLMs to automate the evaluation. In this article, we will delve into how to set this up and make sure it is reliable.

While using AI to evaluate AI may sound circular, we have always had human intelligence evaluate human intelligence (for example, at a job interview or your college finals). Now AI systems can finally do the same for other AI systems.

The process here is for LLMs to generate synthetic ground truth that can be used to evaluate another system. Which begs a question: why not use human feedback directly? Put simply, because you often do not have enough of it.

Getting human feedback on even one percent of your input/output pairs is a gigantic feat. Most teams don’t even get that. In such cases, LLM-assisted evals help you benchmark and test in development prior to production. But in order for this process to be truly useful, it is important to have evals on every LLM sub-call, of which we have already seen there can be many.

complex debugging langchain spans

What Is the Difference Between LLM Model Evaluation and LLM System Evaluation (AKA Task Evaluations)?

LLM_model_evals != LLM_System_evals

LLM model evaluations look at overall macro performance of LLMs at an array of tasks and LLM system evaluations — also referred to as LLM task evaluations — are more system and use-case specific, evaluating components an AI engineer building an LLM app can control (i.e. the prompt template or context).

Since the term “LLM evals” gets thrown around interchangeably, this distinction is sometimes lost in practice. It’s critical to know the difference, however.

Why? Often, teams consult LLM leaderboards and libraries when such benchmarks may not be helpful for their particular use case. Ultimately, AI engineers building LLM apps that plug into several models or frameworks or tools need a way to objectively evaluate everything at highly specific tasks – necessitating system evals that reflect that fact.

What Are LLM Model Evaluations (Evals)?

LLM model evals are focused on the overall performance of the foundational models. The companies launching the original customer-facing LLMs needed a way to quantify their effectiveness across an array of different tasks.

what are llm model evals
In this case, we are evaluating two different open source foundation models. We are testing the same dataset across the two models and seeing how their metrics, like hellaswag or mmlu, stack up.

LLM Model Evaluation Metrics

One popular library that has LLM model evals is the OpenAI Eval library, which was originally focused on the model evaluation use case. There are many metrics out there, like HellaSwag (which evaluates how well an LLM can complete a sentence), TruthfulQA (measuring truthfulness of model responses), and MMLU (which measures how well the LLM can multitask). There’s even a leaderboard that looks at how well the open-source LLMs stack up against each other.

Hugging Face OpenLLM Leaderboard
Hugging Face OpenLLM Leaderboard

What Are LLM System Evaluations (Evals)?

Up to this point, we have discussed LLM model evaluation. In contrast, LLM system evaluation, also sometimes referred to as LLM task evaluation, is the complete evaluation of components that you have control of in your system. The most important of these components are the prompt (or prompt template) and context. LLM system evals assess how well your inputs can determine your outputs.

LLM system evaluation may, for example, hold the LLM constant and change the prompt template. Since prompts are more dynamic parts of your system, this evaluation makes a lot of sense throughout the lifetime of the project. For example, an LLM can evaluate your chatbot responses for usefulness or politeness, and the same eval can give you information about performance changes over time in production.

llm system evaluation
In this case, we are evaluating two different prompt templates on a single foundational model. We are testing the same dataset across the two templates and seeing how their metrics like precision and recall stack up.

When To Use LLM System Evaluations versus LLM Model Evaluations: It Depends On Your Role

There are distinct personas who make use of LLM evaluations. One is the model developer or an engineer tasked with fine-tuning the core LLM, and the other is the practitioner assembling the user-facing system.

There are very few LLM model developers, and they tend to work for places like OpenAI, Anthropic, Google, Meta, and elsewhere. Model developers care about LLM model evals, as their job is to deliver a model that caters to a wide variety of use cases.

For ML practitioners, the task also starts with model evaluation. One of the first steps in developing an LLM system is picking a model (i.e. GPT 3.5 vs 4 vs Palm, etc.). The LLM model eval for this group, however, is often a one-time step. Once the question of which model performs best in your use case is settled, the majority of the rest of the application’s lifecycle will be defined by LLM system evals. Thus, ML practitioners care about both LLM model evals and LLM system evals but likely spend much more time on the latter.

LLM System Evaluation Metrics

Having worked with other ML systems, your first question is likely this: “What should the outcome metric be?” The answer depends on what you are trying to evaluate.

  • Extracting structured information: You can look at how well the LLM extracts information. For example, you can look at completeness (is there information in the input that is not in the output?).
  • Question answering: How well does the system answer the user’s question? You can look at the accuracy, politeness, or brevity of the answer—or all of the above.
  • Retrieval Augmented Generation (RAG): Are the retrieved documents and final answer relevant?

As a system designer, you are ultimately responsible for system performance, and so it is up to you to understand which aspects of the system need to be evaluated. For example, If you have an LLM interacting with children, like a tutoring app, you would want to make sure that the responses are age-appropriate and are not toxic.

What Are the Top LLM System Evaluation Metrics?

The most common LLM evaluation metrics being employed today are relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. Each one of these LLM system evals will have different templates based on what you are trying to evaluate. A fuller list of LLM system evaluation metrics appear below.

Type Description Example Metrics
Diversity Examines the versatility of foundation models in responding to different types of queries Fluency, Perplexity, ROUGE scores
User Feedback Goes beyond accuracy to look at response quality in terms of coherence and usefulness Coherence, Quality, Relevance
Ground Truth-Based Metrics Compares a RAG system’s responses to a set of predefined, correct answers Accuracy, F1 score, Precision, Recall
Answer Relevance How relevant the LLM’s response is to a given user’s query. Binary classification (Relevant/Irrelevant)
QA Correctness Based on retrieved data, is an answer to a question correct? Binary classification (Correct/Incorrect)
Hallucinations Looking at LLM hallucinations with regard  to retrieved context Binary classification (Factual/Hallucinated)
Toxicity Are responses racist, biased, or toxic? Disparity Analysis, Fairness Scoring, Binary classification (Non-Toxic/Toxic)

What LLM Evaluation Metrics are Useful for Retrieval Augmented Generation?

Contextual relevance, faithfulness, and needle in a haystack tests can be useful when evaluating the retrieval portion of LLM RAG evaluation. These metrics and tests help in navigating the potential complexity of these LLM systems when used in tandem with task evaluations.

Prevailing LLM Retrieval Metrics

Contextual relevance and faithfulness are two of the most widely-used metrics for assessing the accuracy and relevance of retrieved files of documents when leveraging LLM RAG.

  • Contextual relevance looks at relevance of the retrieved context to the original query. This can be a binary classification of relevant/irrelevant or ranking metrics can be used (i.e. MRR, Precision@K, MAP, NDCG, etc.)
  • Faithfulness or groundedness looks at how much the foundation model’s response aligns with retrieved context. For example, this can be a binary classification of faithful or unfaithful.

Concrete Example: Evaluating Relevance of Context

Here is an example for evaluating relevance of context using the open-source Arize Phoenix tool for simplicity. Within the Phoenix tool, there exist default templates for most common use cases. Here is the one we will use for this example:

Relevance Eval Template:
You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either “relevant” or “irrelevant”,
and should not contain any text or characters aside from that word.
“irrelevant” means that the reference text does not contain an answer to the Question.
“relevant” means the reference text contains an answer to the Question.

First, we will import all necessary dependencies:

from phoenix.experimental.evals import (
   RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
   RAG_RELEVANCY_PROMPT_RAILS_MAP,
   OpenAIModel,
   download_benchmark_dataset,
   llm_eval_binary,
)
from sklearn.metrics import precision_recall_fscore_support

Now, let’s bring in the dataset:

# Download a "golden dataset" built into Phoenix
benchmark_dataset = download_benchmark_dataset(
   task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
# For the sake of speed, we'll just sample 100 examples in a repeatable way
benchmark_dataset = benchmark_dataset.sample(100, random_state=2023)
benchmark_dataset = benchmark_dataset.rename(
   columns={
       "query_text": "query",
       "document_text": "reference",
   },
)
# Match the label between our dataset and what the eval will generate
y_true = benchmark_dataset["relevant"].map({True: "relevant", False: "irrelevant"})

Now let’s conduct our evaluation:

# Any general purpose LLM should work here, but it is best practice to keep the temperature at 0
model = OpenAIModel(
   model_name="gpt-4",
   temperature=0.0,
)
# Rails will define our output classes
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())


benchmark_dataset["eval_relevance"] = \
   llm_eval_binary(benchmark_dataset,
                   model,
                   RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
                   rails)
y_pred = benchmark_dataset["eval_relevance"]


# Calculate evaluation metrics
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)

How To Get AI To Evaluate AI

There are two distinct steps to the process of evaluating your LLM-based system with an LLM. First, establish a benchmark for your LLM evaluation metric. To do this, you put together a dedicated LLM-based eval whose only task is to label data as effectively as a human labeled your “golden dataset.” You then benchmark your metric against that eval. Then, run this LLM evaluation metric against results of your LLM application (more on this below).

How To Build An LLM Evaluation

The first step, as we covered above, is to build a benchmark for your evaluations.

To do that, you must begin with a metric best suited for your use case. Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback. Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above).

Gold dataset to start llm eval

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

Selecting LLM for eval

Now comes the core component that we are trying to benchmark and improve: the eval template. If you’re using an existing library like OpenAI or Phoenix, you should start with an existing template and see how that prompt performs.

If there is a specific nuance you want to incorporate, adjust the template accordingly or build your own from scratch. Keep in mind that the template should have a clear structure. Be explicit about the following:

  1. What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
  2. What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
  3. What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

what is an eval template

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1-score, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

If you are not satisfied with the performance of your LLM evaluation template, you need to change the prompt to make it perform better. This is an iterative process informed by hard metrics. As is always the case, it is important to avoid overfitting the template to the golden dataset. Make sure to have a representative holdout set or run a k-fold cross-validation.

Generate Metrics

Finally, you arrive at your benchmark. The optimized performance on the golden dataset represents how confident you can be on your LLM eval. It will not be as accurate as your ground truth, but it will be accurate enough, and it will cost much less than having a human labeler in the loop on every example.

precision recall

LLM Benchmarks: Why Is It Important To Use Precision and Recall Rather Than Just Accuracy?

Using accuracy alone when benchmarking an LLM prompt template can be misleading when there is class imbalance in the data or you need to optimize against specific business goals (i.e. minimizing costly false negatives); using precision, recall, and F-score can offer important insight.

The industry has not fully standardized best practices on LLM evals. Given that LLM-assisted task evals built around a golden dataset are essentially assessing a fancy classification model for cognitive tasks, it pays to heed lessons from this area of machine learning.

This is one of the most common problems in data science in action: very significant class imbalance makes accuracy an impractical metric.

Thinking about it in terms of the relevance metric is helpful. Say you go through all the trouble and expense of putting together the most relevant chatbot you can. You pick an LLM and a template that are right for the use case. This should mean that significantly more of your examples should be evaluated as “relevant.” Let’s pick an extreme number to illustrate the point: 99.99% of all queries return relevant results. Hooray!

Now look at it from the point of view of the LLM eval template. If the output was “relevant” in all cases, without even looking at the data, it would be right 99.99% of the time. But it would simultaneously miss all of the (arguably most) important cases — ones where the model returns irrelevant results, which are the very ones we must catch.

In this example, accuracy would be high, but precision and recall (or a combination of the two, like the F1 score) would be very low. Precision and recall are a better measure of your model’s performance here.

The other useful visualization is the confusion matrix, which basically lets you see correctly and incorrectly predicted percentages of relevant and irrelevant examples.

In this example, we see that the highest percentage of predictions are correct: a relevant example in the golden dataset has an 88% chance of being labeled as such by our eval. However, we see that the eval performs significantly worse on “irrelevant” examples, mislabeling them more than 27% of the time.

How To Run LLM Evals On Your Application

At this point you should have both your LLM application and your tested LLM eval. You have proven to yourself that the eval works and have a quantifiable understanding of its performance against a golden dataset.

Now we can actually use our eval on our application. This will help us measure how well our LLM application is doing and figure out how to improve it.

llm evaluation: how it works

The LLM system eval runs your entire system with one extra step. For example:

  1. You retrieve your input docs and add them to your prompt template, together with sample user input.
  2. You provide that prompt to the LLM and receive the answer.
  3. You provide the prompt and the answer to your eval, asking it if the answer is relevant to the prompt.

It is a best practice not to do LLM evals with one-off code but rather a library that has built-in prompt templates. This increases reproducibility and allows for more flexible evaluation where you can swap out different pieces.

These evals need to work in three different environments:

  1. Pre-production when you’re doing the benchmarking. (
  2. Pre-production when you’re testing your application. This is somewhat similar to the offline evaluation concept in traditional ML. The idea is to understand the performance of your system before you ship it to customers.
  3. Production when it’s deployed. Life is messy. Data drifts, users drift, models drift, all in unpredictable ways. Just because your system worked well once doesn’t mean it will do so on Tuesday at 7 p.m. Evals help you continuously understand your system’s performance after deployment.

where you need llm evaluation across llm system lifecycle

Questions To Consider When Building an LLM Evaluation Strategy

How many rows should you sample?

The LLM-evaluating-LLM paradigm is not magic. You cannot evaluate every example you have ever run across—that would be prohibitively expensive. However, you already have to sample data during human labeling, and having more automation only makes this easier and cheaper. So you can sample more rows than you would with human labeling.

Which evals should you use?

This depends largely on your use case. For retrieval augmented generation, relevancy-type evals work best. Toxicity and hallucinations have specific eval patterns.

Some of these evals are important in the troubleshooting flow. Question-answering accuracy might be a good overall metric, but if you dig into why this metric is underperforming in your system, you may discover it is because of bad retrieval, for example. There are often many possible reasons, and you might need multiple metrics to get to the bottom of it.

Should I build my own custom LLM evaluation template?

Many AI engineers — particularly in industries where the data is highly proprietary — may find their use case is best served by building their own evaluation template pre-tested with a golden dataset. Others find that leveraging or adapting pre-tested LLM task evals templates — like these from Phoenix, which are tested against golden datasets that are available as part of the LLM evaluation library’s benchmarked datasets — can be useful common LLM system tasks. As always, it pays to test and iterate often.

As Lou Kratz, Principal Research Engineer at Bazaarvoice, notes: when “you have some unstructured data that you want the LLM to process, really take the time to think through what you’re going to ask it and make one eval that asks that and only that…write your prompt and give it examples if you can. Obviously LLMs are better when when they’re few-shot learners than when they’re zero-shot learners; ask it to output something in the category in a format you can parse.”

What is the difference between online and offline LLM evaluations?

Offline LLM evaluation generally happens in-code, with results pushed to the platform and used for testing prompt changes, development tests, and complex data evals for online data. and online means running as data flows. Online LLM evaluation runs as data is received. The former might make sense for testing pipeline, while production use cases are well served by online evaluation.

What model should I use — and how should I evaluate model changes or new models?

It is impossible to say that one model works best for all cases. Instead, you should run model evaluations to understand which model is right for your application. You may also need to consider tradeoffs of recall vs. precision, depending on what makes sense for your application. In other words, do some data science to understand this for your particular case.


💡 This video walks through a structured, easy way to experiment with different models on LLM apps, walking through how to experiment with different models and prompt changes and compare results.

As always, digesting the latest AI research may be helpful in assessing different foundation models. When using LLMs for time series, for example, Anthropic’s Claude 3 Opus appears to have an edge over OpenAI’s GPT-4.

When and Where Should I Use the Needle In a Haystack Test?

The Needle in a Haystack Test, first created by Gregory Kamradt to help with RAG LLM use cases, is an approach for understanding how well different LLMs can retrieve information buried in context of varying lengths.

While it’s a relatively new technique, the test is very useful for dealing with a the relative common problem faced by AI engineers of LLMs “forgetting” information or overlooking parts of context included in a prompt. It functions by putting specific, tailored information (the “needle”) within a larger, more complex body of text (the “haystack”). The end goal is assessing the LLM’s ability to identify and utilize a specific tidbit in a vast sea of data.

What About Numeric Evaluations?

categorical versus numeric evaluations

Numeric evaluations involve an LLM responding with a number based on a defined evaluation criteria, such as how satisfied a client is on a scale from one to ten based on a transcript of a recent customer service call transcript; categorical evaluations are where an LLM chooses from predefined and often text-based options, like positive/negative or correct/incorrect.

Recent research of major foundation models — including OpenAI’s GPT-4, Anthropic’s Claude, and Mistral AI’s Mixtral-8x7b — suggests that LLMs do not handle continuous ranges well enough yet to be used for numeric score evaluations. Instead, categorical evaluations tend to perform more consistently. These can be binary or multi-output.

Conclusion

Being able to evaluate the performance of your application is very important when it comes to production code. In the era of LLMs, the problems have gotten harder, but luckily we can use the very technology of LLMs to help us in running evaluations. Such evaluation should test the whole system and not just the underlying LLM model—think about how much a prompt template matters to user experience. Best practices, standardized tooling, and curated datasets simplify the job of developing LLM systems.