Evaluation

Grading the performance of your LLM application

An evaluation example for a Q&A chatbot in Arize

To ensure application performance, you need a way to judge the quality of your LLM outputs.

Without evals, AI engineers don’t know if prompt changes will actually improve performance. They don't know whether changing a prompt, LLM parameter, agentic loop, or retrieval step will break a use case or improve performance.

With evals, when you adjust your prompts, agents, and retrieval, you learn whether your application performance has improved or not.

Choose your evaluation method

There are three types of evaluators you can build — LLM, code, and annotations. Each has its uses depending on what you want to measure.

How it works
Ideal use cases

One LLM evaluates the outputs of another and provides explanations

Great for qualitative evaluation and direct labeling on mostly objective criteria Poor for quantitative scoring, subject matter expertise, and pairwise preference

Code assesses the performance, accuracy, or behavior of LLMs

Great for reducing cost, latency, and evaluation that can be hard-coded (e.g. code-generation) Poor for qualitative measures such as summarization quality

Humans provide custom labels to LLM traces

Great for evaluating the evaluator, labeling with subject matter expertise, and directional application feedback The most costly and time intensive option

Choose your metric

Wherever you find the highest frequency of problematic traces is where you can start building evaluations. We have many pre-built templates for common evaluation cases for user frustration, hallucination, and agent planning, and function calling.

You can setup these evaluations to create a set of metrics to measure your application performance. In the example below, we are judging the quality of a customer support chatbot on relevance, hallucination, and latency.

Components of an LLM evaluation

The components of an evaluation

LLM evaluation requires you to define:

  1. The input data: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.

  2. The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.

  3. The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.

  4. The aggregate metric: when you run thousands of evaluations across a large dataset, you can use your aggregation metrics to summarize the quality of your responses over time across different prompts, retrievals, and LLMs.

LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses. You can run thousands of evaluations across curated data without the need for human annotation. This speeds up your prompt iteration and ensures you can deploy your applications to production with confidence.

Building good evaluation templates

You can adjust an existing template or build your own from scratch. Experiment with different models and LLM parameters. Be explicit about the following:

  • What is the input? In our example, it is the documents/context that was retrieved and the query from the user.

  • What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query

  • What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.

MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

Then benchmark your evaluation template based on your own data. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.

Building such a dataset is laborious, but you can often find standardized datasets for common use cases. Then, run the eval across your golden dataset and generate metrics (overall accuracy, precision, recall, F1, etc.) to determine your benchmark.

Learn more

Types of LLM evals

Last updated

Was this helpful?