LLM Eval Types

Chapter Summary

This chapter explores various evaluation methods, including LLM-as-a-Judge and code-based evaluations, detailing how these approaches address challenges like limited user feedback and the high cost of human annotation. The focus is on automating evaluation while maintaining reliability.

Learn about the foundational methods for evaluating LLMs, including “LLM as a Judge” and code-based evaluations, and when to use each approach. For direct guidance on setting up evaluations, go to product documentation.

LLM as a Judge Eval

Often called LLM as a judge, LLM-assisted evaluation uses AI to evaluate AI — with one LLM evaluating the outputs of another and providing explanations.

LLM-assisted evaluation is often needed because user feedback or any other “source of truth” is extremely limited and often nonexistent (even when possible, human labeling is still expensive) and it is easy to make LLM applications complex.

LLM as Judge diagram

Fortunately, we can use the power of LLMs to automate the evaluation. In this eBook, we will delve into how to set this up and make sure it is reliable.

While using AI to evaluate AI may sound circular, we have always had human intelligence evaluate human intelligence (for example, at a job interview or your college finals). Now AI systems can finally do the same for other AI systems.

The process here is for LLMs to generate synthetic ground truth that can be used to evaluate another system. Which begs a question: why not use human feedback directly? Put simply, because you often do not have enough of it.

Getting human feedback on even one percent of your input/output pairs is a gigantic feat. Most teams don’t even get that. In such cases, LLM-assisted evals help you benchmark and test in development prior to production. But in order for this process to be truly useful, it is important to have evals on every LLM sub-call, of which we have already seen there can be many.

You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference]: {context}
    ************
    [Answer]: {sampled_answer}
    [END DATA]
Your response must be a single word, either “correct” or “incorrect”, and should not contain any text or characters aside from that word. “correct” means that the question is correctly and fully answered by the answer. “incorrect” means that the question is not correctly or only partially answered by the answer.

Code-Based Eval

Code-based LLM evaluations are methods that use programming code to assess the performance, accuracy, or behavior of large language models (LLMs). These evaluations typically involve creating automated scripts or CI/CD test cases to measure how well an LLM performs on specific tasks or datasets. A code-based eval is essentially a python or JS/TS unit test.

Code-based evaluation is sometimes preferred as a way to reduce costs as it does not introduce token usage or latency. When evaluating a task such as code generation, a code-based eval will often be the preferred method since it can be hard coded, and follows a set of rules. However, for evaluation that can be subjective, like hallucination, there’s no code evaluator that could provide that label reliably, in which case LLM as a Judge needs to be used.

Common use cases for code-based evaluators include:

LLM Application Testing

Code-based evaluations can test the LLM’s performance at various levels—focusing on ensuring that the output adheres to the expected format, includes necessary data, and passes structured, automated tests.

  • Test Correct Structure of Output: In many applications, the structure of the LLM’s output is as important as the content. For instance, generating JSON-like responses, specific templates, or structured answers can be critical for integrating with other systems.
  • Test Specific Data in Output: Verifying that the LLM output contains or matches specific data points is crucial in domains such as legal, medical, or financial fields where factual accuracy matters.

  • Structured Tests: Automated structured tests can be employed to validate whether the LLM behaves as expected across various scenarios. This might involve comparing the outputs to expected responses or validating edge cases.

Evaluating Your Evaluator

Evaluating the effectiveness of your evaluation strategy ensures that you’re accurately measuring the model’s performance and not introducing bias or missing crucial failure cases. Code-based evaluation for evaluators typically involves setting up meta-evaluations, where you evaluate the performance or validity of the evaluators themselves.

In order to evaluate your evaluator, teams need to create a set of hand annotated test datasets. These test datasets do not need to be large in size, 100+ examples are typically enough to evaluate your evals. In Arize Phoenix, we include test datasets with each evaluator to help validate the performance for each model type.

We recommend this guide for for a more in depth review of how to improve and check your evaluators.

“Evaluating the effectiveness of your evaluation strategy ensures that you’re accurately measuring the model’s performance and not introducing bias or missing crucial failure cases.”

Eval Output Type

Depending on the situation, the evaluation can return different types of results:

  • Categorical (Binary): The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.
  • Categorical (Multi-class): The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.

  • Continuous Score: The evaluation results in a numeric value within a set range (e.g. 1-10), offering a scale of measurement. We don’t recommend using this approach.

  • Categorical Score: A value of either 1 or 0. The categorical score can be pretty useful as you can average your scores but don’t have the disadvantages of continuous range.

Although score evals are an option, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

Categorical evals, especially multi-class, strike a balance between simplicity and the ability to convey distinct evaluative outcomes, making them more suitable for applications where precise and consistent decision-making is important.

class ExampleResult(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using All Inputs")
        return(EvaluationResult(score=score, label=label, explanation=explanation)
        
class ExampleScore(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using A float")
        return 1.0
      
class ExampleLabel(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator label")
        return "good"

Online vs Offline Evaluation

Evaluating LLM applications across their lifecycle requires a two-pronged approach: offline and online. Offline LLM evaluation generally happens during pre-production, and involves using curated or outside datasets to test the performance of your application. Online LLM evaluation runs once your app is in production, and is run on production data. The same evaluator can be used to run online and offline evaluations.

Offline LLM Evaluation

Offline LLM evaluation generally occurs during the development and testing phases of the application lifecycle. It involves evaluating the model or system in controlled environments, isolated from live, real-time data. The primary focus of offline evaluation is pre-deployment validation CI/CD, enabling AI engineers to test the model against a predefined set of inputs (like golden datasets) and gather insights on performance consistency before the model is exposed to real-world scenarios. This process is crucial for:

  • Prompt and Output Validation: Offline tests allow teams to evaluate prompt engineering changes and different model versions before committing them to production. AI engineers can experiment with prompt modifications and evaluate which variants produce the best outcomes across a range of edge cases.

  • Golden Datasets: Evaluating LLMs using golden datasets (high-quality, annotated data) ensures that the LLM application performs optimally in known scenarios. These datasets represent a controlled benchmark, providing a clear picture of how well the LLM application processes specific inputs, and enabling engineers to debug issues before deployment.

  • Pre-production Check: Offline evaluation is well-suited for running CI/CD tests on datasets that reflect complex user scenarios. Engineers can check the results of offline tests and changes prior to pushing those changes to production.

“Having one unified system for both offline and online evaluation allows you to easily use consistent evaluators for both techniques.”

Note: The “offline” part of “offline evaluations” refers to the data that is being used to evaluate the application. In offline evaluations, the data is pre-production data that has been curated and/or generated, instead of production data captured from runs of your application. Because of this, the same evaluator can be used for offline and online evaluations. Having one unified system for both offline and online evaluation allows you to easily use consistent evaluators for both techniques.

Online LLM Evaluation

Online LLM evaluation, by contrast, takes place in real-time, during production. Once the application is deployed, it starts interacting with live data and real users, where performance needs to be continuously monitored. Online evaluation provides real-world feedback that is essential for understanding how the application behaves under dynamic, unpredictable conditions. It focuses on:

  • Continuous Monitoring: Applications deployed in production environments need constant monitoring to detect issues such as degradation in performance, increased latency, or undesirable outputs (e.g., hallucinations, toxicity). Automated online evaluation systems can track application outputs in real time, alerting engineers when specific thresholds or metrics fall outside acceptable ranges.

  • Real-Time Guardrails: LLMs deployed in sensitive environments may require real-time guardrails to monitor for and mitigate risky behaviors like generating inappropriate, hallucinated, or biased content. Online evaluation systems can incorporate these guardrails to ensure the LLM application proactively being protected rather than reactively.

Read More About Online vs Offline Evaluations

Check out our quickstart guide to evaluations including online, offline, and code evaluations with templates.

Choosing Between Online and Offline Evaluation

While it may seem advantageous to apply online evaluations universally, they introduce additional costs in production environments. The decision to use online evaluations should be driven by the specific needs of the application and the real-time requirements of the business. AI engineers can typically group their evaluation needs into three categories: offline evaluation, guardrails, and online evaluation.

  • Offline evaluation: Offline evaluations are used for checking LLM application results prior to releasing to production. Use offline evaluations for CI/CD checks of your LLM application.

    Example: Customer service chatbot where you want to make certain changes to a prompt do not break previously correct responses.

  • Guardrail: AI engineers want to know immediately if something isn’t right and block or revise the output. These evaluations run in real-time and block or flag outputs when they detect that the system is veering off-course.

    Example: An LLM application generates automated responses for a healthcare system. Guardrails check for critical errors in medical advice, preventing harmful or misleading outputs from reaching users in real time.

  • Online evaluation: AI engineers don’t want to block or revise the output, but want to know immediately if something isn’t right. This approach is useful when it’s important to track performance continuously but where it’s not critical to stop the model’s output in real time.

    Example: An LLM application generates personalized marketing emails. While it’s important to monitor and ensure the tone and accuracy are correct, minor deviations in phrasing don’t require blocking the message. Online evaluations flag issues for review without stopping the email from being sent.

Model Benchmarks vs Task Evals

LLM model evaluations look at overall macro performance of LLMs at an array of tasks and LLM system evaluations — also referred to as LLM task evaluations — are more system and use-case specific, evaluating components an AI engineer building an LLM app can control (i.e. the prompt template or context).

Since the term “LLM evals” gets thrown around interchangeably, this distinction is sometimes lost in practice. It’s critical to know the difference, however.

Why? Often, teams consult LLM leaderboards and libraries when such benchmarks may not be helpful for their particular use case. Ultimately, AI engineers building LLM apps that plug into several models or frameworks or tools need a way to objectively evaluate everything at highly specific tasks – necessitating system evals that reflect that fact.

“LLM system evaluations — also referred to as LLM task evaluations — are more system and use-case specific, evaluating components an AI engineer building an LLM app can control.”

LLM model evals are focused on the overall performance of the foundational models. The companies launching the original customer-facing LLMs needed a way to quantify their effectiveness across an array of different tasks.

LLM Model Evals diagram

In contrast, LLM system evaluation, also sometimes referred to as LLM task evaluation, is the complete evaluation of components that you have control of in your system. The most important of these components are the prompt (or prompt template) and context. LLM system evals assess how well your inputs can determine your outputs.

LLM system evaluation may, for example, hold the LLM constant and change the prompt template. Since prompts are more dynamic parts of your system, this evaluation makes a lot of sense throughout the lifetime of the project. For example, an LLM can evaluate your chatbot responses for usefulness or politeness, and the same eval can give you information about performance changes over time in production.

LLM System Evals diagram

There are a lot of common LLM evaluation metrics being employed today, such as relevance, hallucinations, question-answering accuracy, toxicity, and retrieval-specific metrics. However, most teams handcraft metrics based on their business use cases. Each one of these LLM system evals will have different templates based on what is being evaluated.

What are you Evaluating?

When evaluating LLM applications, the primary focus is on three key areas: the task, historical performance, and golden datasets. Task-level evaluation ensures that the application is performing well on specific use cases, while historical traces provide insight into how the application has evolved over time. Meanwhile, golden datasets act as benchmarks, offering a consistent way to measure performance against well-established ground truth data.