LLM Evaluation
LLM evaluation focuses on measuring the level at which LLM-generated responses adhere to desired standards of performance, ethics, and safety. The goal of LLM evaluation is to ensure that the LLM outputs are aligning with their intended outputs.
LLM evaluations can take many forms, from code-based comparisons against ground-truth data, to LLM as a Judge queries to validate outputs. This resource covers different types of LLM evals, how they are used, and important factors to consider when structuring your LLM evaluation system.
Why is LLM Evaluation Important?
LLMs are non-deterministic by nature. They can produce a massive range of outputs. This means that applications built on LLMs require a different approach to testing than traditional software applications.
In many ways, LLM evaluation is the integration testing of the AI world. Evals are designed to check for regressions in your application, screen for large performance issues, and provide confidence in your application’s performance.
LLM evaluation helps ensure that despite the variability of the LLM, the application maintains a consistent level of quality and reliability. Without evaluation, it is very difficult to know if the application is performing as intended.
Additionally, LLM evaluation allows for both qualitative and quantitative understanding of the performance of the application. For example, evaluating the LLM for hallucinations is a common qualitative measurement. Examining hallucination rate over time can be key to understanding whether a model’s performance is increasing or degrading over time.
LLM evaluations also allow for meaningful comparisons between different versions of the application. For example, changing the model or prompt will have different effects downstream, and evaluations enable the ability to compare the effects of making these changes. The application developer can create these experiments to decide which model and prompt lead to the best application performance.
Furthermore, LLM evaluation is very useful for identifying unknown edge cases of the application. For some applications that have large amounts of volume, it isn’t possible to know or analyze every individual interaction with the application. Evaluation can reveal previously unseen scenarios where the LLM performs poorly. All of this allows developers to implement safeguards or make changes to the application to account for these cases.
What are the Key LLM Evaluations and Metrics?
The specific aspects of LLM responses that need to be evaluated are unique for each application, as each use case is different. Ultimately, application owners have to decide what is important for their specific use case.
For example, an internal RAG chatbot application built for answering HR-related questions for employees might be more concerned about providing relevant answers and less concerned about tone of the responses than a customer-facing application that is designed for interacting with children.
For a text-to-SQL application, it would make sense to evaluate whether the returned SQL syntax is correct. It might also make sense to evaluate whether the LLM-generated SQL actually queries the intended results.
While choosing what to evaluate for each application is unique, there are some standard evaluations that we see:
WHAT IS THE EVALUATOR ? | WHAT DOES IT DO ? |
Hallucinations | Evaluates whether an output contains information not available in the reference text given an input query |
Question Answering | Evaluates whether an output fully answers a question correctly given an input query and reference documents |
Retrieved Document Relevancy | Evaluates whether a reference document is relevant or irrelevant to the corresponding input |
Toxicity | Evaluates whether an input string contains racist, sexist, chauvinistic, biased, or otherwise toxic content. |
Summarization | Evaluates whether an output summary provides an accurate synopsis of an input document. |
Code Generation | Evaluates whether code correctly implements the query |
Toxicity | Evaluates whether text is toxic |
Human Vs AI | Compares human text vs generated text |
Citation | Check if the citation correctly answers the question by looking at the text on the cited page & conversation |
User Frustration | Check if the user is frustrated in the conversation |
SQL Generation | Check if SQL Generation is correct based on the question |
Tool Calling | Check if tool calling function calls and extracted params are correct. |
How to Run LLM Evaluations
There are countless evaluations that can be used to measure the performance of an LLM application, and while selecting the correct evaluations for the use case is a key decision, it is equally as important to select the correct approach to evaluating the application.
One unique factor to consider within LLM evaluation is that regression can occur even when your underlying application hasn’t changed. The LLM your application uses may begin to give different responses, or you may see input drift in your application itself. Because of this, it is important to regularly run LLM evaluations on your application, and not only tie runs to app updates.
What are the Prevailing Evaluation Methods?
Methods for evaluating LLM applications include human labeling, user feedback, LLM as a judge, and ground truth and business metric comparisons. Each has benefits and drawbacks relevant to different use cases.
In practice, we recommend combining multiple LLM evaluation methods. Each has their advantages and drawbacks, and so the best route to full coverage is through sampling each technique.
How is Human Labeling Used in LLM Systems?
Human labeling is a widely used method for evaluating LLMs. This approach involves having humans assess LLM outputs based on criteria that the application owners deem to be important indicators of performance. This criteria can include the standard evaluations mentioned above, as well as a countless number of other more use case-specific evaluations. Human labelers can even combine all evaluations to give an overall evaluation score to rate performance.
One advantage of this approach is the nuanced, context-aware evaluation that human labelers provide. Generally, humans who are experts on a nuanced topic will be able to provide the most accurate evaluations for that specific topic. Additionally, humans have the ability to think critically and creatively, which provides the nuanced evaluation needed for judging some LLM responses. Humans can also identify potential biases or errors that automated metrics might miss.
However, human labeling is incredibly time consuming and expensive. Many LLM applications generate millions of responses each day, and it would be unfeasible from a cost and time perspective to have humans attempt to label this volume. Labeling a sample of the LLM responses is an alternative solution, but this could lead to sampling bias as well as missed edge cases that could lead to critical issues.
It’s best to use human labeling either early on in the development stage of your application, or as a complement to other evaluation methods. For example, human labeled data can serve as great few-shot examples to augment LLM-based evaluations.
How is User Feedback Used in Evaluating LLM Systems?
User feedback can be another useful approach for evaluating the performance of an LLM application. This approach leverages the application’s actual users to assess various aspects such as relevance, usefulness, and overall quality of the LLM outputs. Generally, applications will offer the ability to provide a positive (thumbs up) or negative (thumbs down) review of the application at the conversation (session) level. However, some applications offer the ability for the user to rate each individual message, and others allow for free-text feedback at the session or message level.
User feedback is a great way to scalably capture nuanced, context-dependent evaluations from the actual end users of the application. These evaluations provide valuable information about how the responses and application as a whole are being received by its consumers. User feedback can help identify issues that may not be apparent in controlled testing environments. It can also help gauge user satisfaction/frustration.
However, using user feedback to measure LLM performance has drawbacks. It can be subjective, potentially biased by individual user expectations or expertise levels. It may also not always align with objective measures of performance. Additionally, collecting and analyzing user feedback can be inconsistent as some users might choose to not provide feedback. Generally, very happy or frustrated users provide feedback more often than users who are neutral, so user feedback can paint a biased picture of an application’s performance.
User feedback is most appropriate when evaluating aspects of the LLM application that directly impact user experience, such as response relevance and task completion success. It’s particularly valuable for long-term performance monitoring and iterative improvements. However, it may not be suitable for assessing model accuracy, evaluating potential biases, or ensuring safety and ethical compliance.
Why is LLM as a Judge an Important Approach?
LLM as a Judge refers to using LLMs to evaluate the responses of another LLM. It may seem counterintuitive, but it is often easier for an LLM to evaluate an LLM response than to generate that response on its own. This is especially true when the task you present to the Judge LLM is only a piece of the task completed by the original LLM.
Therefore, using LLMs as a Judge to evaluate other LLMs can be a great way to scale the evaluation of the LLM.
The most common LLM evaluation metrics being employed today are evaluations for relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. Evaluation prompt templates are provided to the LLM that aid in the assessment of each of the LLM application’s responses.
Here’s a pre-tested hallucination example:
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.
# Query: {query}
# Reference text: {reference}
# Answer: {response}
Is the answer above factual or hallucinated based on the query and reference text?
A main concern of using LLMs to judge LLMs is knowing how good a specific LLM is at evaluating an LLM for a specific use case. In some cases, pre-tested LLM Judge templates can be found, like those in our Phoenix library.
In the absence of pre-tested templates, comparing the responses of an LLM Judge against a golden dataset, aka ground truth data, is typically the best way to measure its efficacy. If you don’t have a golden dataset to use, you can use human labeling to create one.
Ground Truth Comparisons
On the subject of ground truth or golden datasets, another common approach to LLM Evaluation is to test the performance of an LLM-powered application on a set of inputs with known outputs. This approach does require having that golden dataset with expected outputs.
LLMs cannot be relied upon to produce exactly the same output for a given input on repeat runs, however they should produce similar values. Often when it comes to ground truth comparisons, it is necessary to use semantic similarity to determine whether an output is the “same” as the expected output.
Semantic similarity can be calculated in a number of ways, for example through Levenshtein distance (aka edit distance) or via cosine similarity. Each of these techniques produces a numeric representation of the similarity of two strings. If that similarity is above a threshold value, then the strings are considered to represent the same meaning.
Business Metrics
A final approach to LLM evaluation is to use business metrics to determine the model’s performance. Evaluating LLMs through business metrics offers a pragmatic approach that directly links model performance to real-world impact. This method focuses on measuring how LLMs affect key performance indicators (KPIs) crucial to an organization’s success.
For example, in a chat-to-purchase application, success can be determined by whether the user purchases the product recommended by the LLM. In a customer support use case, an example of using business metrics to evaluate the LLM would be whether the LLM chatbot fully resolved the user’s issue, without needing to involve a human representative.
Aligning model performance with business objectives provides tangible evidence of the impact the LLM is having on the business.
While valuable for measuring economic impact, the business metrics approach has a significant blind spot when it comes to providing a full understanding of other aspects of the LLM responses such as accuracy, relevance, bias, and toxicity. Additionally, isolating the LLM’s impact from other factors can be complex and challenging.
Using the chat-to-purchase application as an example, knowing someone didn’t purchase the recommended product doesn’t offer any insights into whether this was due to a poor LLM response or for some other reason. Perhaps the LLM didn’t recommend a product that was relevant to the user’s query?
Therefore, it’s best to combine business metrics with other evaluation methods to get a holistic view of an LLM’s performance.
Putting it All Together: Final Thoughts on LLM Evaluations
In the unpredictable world of generative AI, it can be difficult to control and predict the responses of an LLM. Being able to evaluate the performance of an LLM application allows for the developers to have visibility into edge cases, apply safeguards against unintended responses, and make changes to the application to gain more control of the application’s responses.
On this page
- Why is LLM Evaluation Important?
- What are the Key LLM Evaluations and Metrics?
- How to Run LLM Evaluations
- What are the Prevailing Evaluation Methods?
- How is Human Labeling Used in LLM Systems?
- How is User Feedback Used in Evaluating LLM Systems?
- Why is LLM as a Judge an Important Approach?
- Ground Truth Comparisons
- Business Metrics
- Putting it All Together: Final Thoughts on LLM Evaluations