Pre Production

Chapter Summary

The chapter discusses building high-quality, curated datasets as the foundation for reliable evaluations. Techniques like human annotation, synthetic datasets, and benchmarking are explored to create robust evaluation strategies for pre-deployment testing.

Learn how to build and curate golden datasets, and benchmark LLM eval with Arize through our product documentation.

Human Annotation and Curated Datasets

In the process of evaluating large language models (LLMs) or other AI systems, building a high-quality, curated golden dataset is essential. A curated golden dataset refers to a collection of examples that are carefully crafted and validated to serve as the “ground truth” or benchmark for evaluating the model’s performance. This dataset forms the backbone of many evaluation strategies, ensuring that the results are reliable and consistent.

  1. Creating a Hand-Crafted Dataset

    The simplest form of a curated golden dataset starts with hand-crafted examples. In this approach, subject matter experts or dataset designers create examples manually to capture different aspects of the task or domain being evaluated. These examples represent various inputs that the model is expected to handle, along with their correct or expected outputs.

    For example, if evaluating a model for summarization, you could manually create a series of text passages with corresponding summaries that are ideal representations of what the model should generate. The strength of this approach is that it allows for the creation of nuanced and challenging examples tailored to specific use cases
    or edge cases.

  2. Annotating the Dataset with Ground Truth

    Once the hand-crafted dataset is created, human annotators play a key role in adding ground truth. Ground truth refers to the correct answers or labels that serve as the standard for evaluation. In some cases, annotators may need to modify or refine the original labels if additional examples are included later.

    For instance, annotators could be asked to label the output of a language model as “correct” or “incorrect” based on whether it matches the expected behavior. This ground truth will then serve as the reference point when evaluating how well the model performs in comparison.

  3. Multi-Annotator Validation: Ensuring Consensus and Accuracy

    To ensure the quality and reliability of the ground truth data, it’s crucial to validate the annotations across multiple annotators. In this process, several annotators independently label or validate each example. A common approach is to use the consensus of at least two out of three annotators to confirm the correct label. If two out of three annotators agree on the same label, that label becomes the confirmed ground truth.

    This multi-annotator strategy helps to reduce bias or errors that could arise from individual perspectives and ensures that the dataset is robust and reliable. Additionally, it is common to perform checks to ensure that annotators have a high level of agreement (called inter-annotator agreement), which further strengthens the trustworthiness of the dataset.

Annotations and User Feedback

The rise of Reinforcement Learning from Human Feedback (RLHF) has highlighted the importance of human feedback in training and refining LLM applications. Whether you’re manually labeling subtle response variations, curating datasets for experimentation, or logging real-time user feedback, having a robust system for capturing and cataloging annotations is critical to improving the performance and accuracy of your LLM application.

Annotations are custom labels that can be added to LLM traces or spans, allowing AI engineers to track performance and gather insights at a granular level. Annotations help to:

  • Annotate Production Span Data: eams want to annotate data directly on top of production responses allowing those annotations to be used for filtering, analytics or production data analysis.

  • Categorize Spans or Traces: Assign categories to specific parts of a conversation or output, enabling more detailed analysis of where a model succeeds or fails.

  • Annotate Datasets for Experimentation: Use human-labeled data to create high-quality datasets for testing and refining LLM applications, for handcrafted CI/CD tests, few-shot prompting, or targeted evaluations.

  • Annotation Queues: The use of annotation queues has grown in use in LLM Observability tools. The queues in Arize do not move data, but assign labeling tasks to annotators on top of current data, either production data or dataset data. When labels are needed on very specific types of data, that data is added to queues, and annotators work through those specific queues. The added labels will then appear on the original data.

  • Log Real-Time Feedback: Collect feedback from live applications through APIs, allowing for dynamic, continuous improvements based on actual user interactions.These are not always viewed as annotations but are worth a mention in this section.

“Having a robust system for capturing and cataloging annotations is critical to improving the performance and accuracy of your LLM application.”

Annotations are particularly valuable for:

  • Evaluating Agreement/Disagreement: Identifying where human evaluators and LLM evaluations align or diverge can reveal areas for improving automated evaluations.

  • Subject Matter Expertise: In complex domains (e.g., medical, legal, or customer service applications), expert feedback is crucial to determining the quality of the application. This input complements automated metrics and provides deeper insight into how well the application performs in specialized contexts.

  • Gathering Direct Application Feedback: Integrate feedback mechanisms directly into live applications, capturing user responses and experiences to continuously improve LLM outputs.

A well-implemented annotation and feedback system is essential for refining LLM applications, ensuring that human expertise and real-world use cases are properly incorporated into the evaluation process.

Human Annotation in LLM Evaluation

Human annotation is a critical component in evaluating LLMs and other AI systems. By combining hand-crafted datasets, ground truth labels, and multi-annotator validation, organizations can create a golden dataset that is rich in accuracy and diversity. This dataset allows for a more meaningful and comprehensive evaluation of model performance, providing the context needed to interpret automated metrics and improve model outputs.

Creating and Validating Synthetic Datasets as Golden Datasets

Synthetic datasets are artificially created datasets that are designed to mimic real-world information. Unlike naturally occurring data, which is gathered from actual events or interactions, synthetic datasets are generated using algorithms, rules, or other artificial means. These datasets are carefully created to represent specific patterns, distributions, or scenarios that developers and researchers want to study or use for testing.

“By using synthetic data, developers can create controlled environments for experimentation, ensure coverage of edge cases, and protect privacy by avoiding the use of real user data.”

In the context of large language models, synthetic datasets might include:

  • Generated text conversations simulating customer support interactions.

  • Artificial question-answer pairs covering a wide range of topics.

  • Fabricated product reviews with varying sentiments and styles.

  • Simulated code snippets with intentional bugs or specific patterns.

By using synthetic data, developers can create controlled environments for experimentation, ensure coverage of edge cases, and protect privacy by avoiding the use of real user data.

The applications of synthetic datasets are varied and valuable:

  • They allow us to test and validate model performance, especially for assessing how well models perform specific tasks.

  • Synthetic data helps generate initial traces of application behavior, facilitating debugging in tools like Arize.

  • Perhaps most importantly, synthetic datasets serve as “golden data” for consistent experimental results. This is particularly useful when developing and experimenting with applications that haven’t yet launched.

Combining Synthetic Datasets with Human Evaluation

While synthetic datasets offer many advantages, they may sometimes miss key use cases or types of inputs that humans would naturally consider. Human-in-the-loop processes are valuable for dataset improvement.

Recent research has shown that including even a small number of human-annotated examples can significantly improve the overall quality and effectiveness of a synthetic dataset. This hybrid approach combines the scalability of synthetic data with the understanding that human evaluators provide.

Human annotators can add targeted examples to synthetic datasets to address gaps or underrepresented scenarios. This process of augmenting synthetic data with human-curated examples can be easily implemented using tools like Arize.

The addition of these human-annotated examples can be particularly effective in improving the dataset’s performance in few-shot learning scenarios, where models need to generalize from a small number of examples.

Best Practices for Synthetic Dataset Use

Synthetic datasets are not static, one-time creations – they are dynamic tools that require ongoing attention. To maintain their usefulness, developer must do several things:

First, implement a regular refresh cycle. Revisit and update your datasets periodically to keep pace with model improvements and account for data drift in real-world applications.

Second, transparency is key in synthetic data generation. Maintain detailed records of the entire process, including the prompts used, models employed, and any post-processing steps applied.

Third, regular evaluation is important. Assess the performance of your synthetic datasets against real-world data and newer models on an ongoing basis.

Finally, when augmenting synthetic datasets with human-curated examples, take a balanced approach. Add enough human input to enhance the dataset’s quality and coverage, but be careful not to overwhelm the synthetic component.

By adhering to these practices, you can maximize the long-term value and reliability of your synthetic datasets, making them powerful tools for ongoing model evaluation and experimentation.

Learn more here.

Benchmarking LLM Evaluation

Benchmarking LLM evaluation is critical to ensure your evaluation strategy addresses these core areas:

  • Evaluation Accuracy: Does the evaluation process correctly capture the quality of outputs generated by the application? This includes testing whether your evaluation metrics (e.g., relevance, accuracy) accurately reflect real-world outcomes like user satisfaction or task completion.

  • Consistency Across Scenarios: Benchmark the evaluation process for consistency across a variety of application scenarios, including edge cases, stress tests, and diverse input types. The goal is to ensure that evaluations remain reliable and do not favor certain cases over others.

  • Evaluation Latency: How fast can evaluations be conducted in real-time environments? Latency in generating evaluation results is critical for applications where fast feedback loops are essential.

  • Human vs. Automated Evaluations: Compare automated evaluation systems (such as those scoring relevance or accuracy) against human evaluators (annotators) to ensure alignment. This benchmarking helps ensure that automated processes reliably approximate human judgment.

Benchmarking LLM as a Judge

Begin with a metric best suited for your use case. Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that you can measure performance of the LLM eval template.

Benchmarking LLM as a Judge diagram

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

Benchmarking LLM as a Judge diagram

Now comes the core component that we are trying to benchmark and improve: the eval template. If you’re using an existing library like OpenAI or Arize Phoenix, you should start with an existing template and see how that prompt performs.

  • What is the input? In our example, it is the documents/context that was retrieved and the query from the user.

  • What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query.

  • What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.

MY_CUSTOM_TEMPLATE = ‘’’
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either “positive” or “negative”
    ‘’’

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1-score, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

Benchmarking LLM as a Judge diagram

If you are not satisfied with the performance of your LLM evaluation template, you need to change the prompt to make it perform better. This is an iterative process informed by hard metrics. As is always the case, it is important to avoid overfitting the template to the golden dataset.

Benchmarking LLM as a Judge diagram

Finally, you arrive at your benchmark. The optimized performance on the golden dataset represents how confident you can be on your LLM eval. It will not be as accurate as your ground truth, but it will be accurate enough, and it will cost much less than having a human labeler in the loop on every example.

Precision Recall
Relevant 0.70 0.70
Irrelevant 0.89 0.89

Evals with Explanations

It can be hard to understand in many cases why an LLM responds in a specific way. Explanations showcase why the LLM decided on a specific score for your evaluation criteria, and may even improve the accuracy of the evaluation.

Arize UI screenshot

Eval Hierarchy

Evaluations can occur at different levels of granularity. At the most basic level, span-level evaluations assess the performance of specific components within an application’s response. A trace-level evaluation looks at trends across multiple full runs of your application, while session-level evaluation expands the scope to multiple interactions. Understanding this hierarchy allows for a more structured and comprehensive evaluation strategy.

Building a Robust Evaluation Approach

In order to ensure that LLMs are performing optimally and reliably, a well-thought-out evaluation framework is essential. A robust evaluation approach needs to account for several factors that influence both the practical usability of the framework and its ability to scale with evolving models. There are many existing evaluation frameworks, including our evals library in Arize Phoenix, and there is always the option to build your own system. Below are the key aspects that to consider when choosing or building a solid and comprehensive evaluation system.

 “A robust evaluation approach needs to account for several factors that influence both the practical usability of the framework and its ability to scale with evolving models.”

Ergonomics: How Easy is it to Use?

A user-friendly evaluation framework is crucial for broad adoption and frequent usage. If the framework is too complex or cumbersome, it discourages experimentation and regular assessments. Good ergonomics ensure that setting up and running evaluations is intuitive, even for non-technical users or stakeholders. A good evaluation system should allow for quick setup, intuitive workflows, and ease of collaboration between different stakeholders.

Parallelization: Can the Framework Support Parallel Evaluation Calls?

The ability to run evaluations in parallel is a game-changer when dealing with large-scale applications or high volumes of requests. Parallelization speeds up the evaluation process, allowing the framework to assess multiple models or multiple datasets simultaneously, leading to more efficient workflows.

Online/Offline Evaluation Support

An effective evaluation framework should be adaptable to both online (real-time) and offline (batch-processed) environments. While offline evaluations are crucial for testing in a controlled environment, online evaluations offer insights into real-time performance under dynamic, real-world conditions.

Flexibility: Custom Templates for LLM as a Judge

The evaluation framework should be flexible enough to accommodate a wide range of use cases, tasks, and metrics. Customizable evaluation templates are key to this flexibility, enabling teams to tailor the evaluation process to specific requirements, including different task types or performance goals.

UI vs. Non-UI: Does the Framework Provide a UI or Is it Pure Code?

Having both a UI and a non-UI (code-based) option in an evaluation framework caters to a wider audience. A UI allows for easy setup and management of evaluations, especially for non-technical users, while a code-based option offers more customization and control for developers.

Scale: Handling Increasing Complexity and Calls

As the complexity of LLM applications grows, so does the need for a scalable evaluation framework. Scalability encompasses the ability to handle increasing calls, higher throughput, and more intricate evaluation criteria without sacrificing performance.

Including Explanations: Providing Interpretability

Including explanations in the evaluation results adds a layer of interpretability to the evaluation process. Rather than simply providing a pass/fail result or a numerical score, it is essential to explain why the application performed well or poorly on specific tasks.

Choosing an Evaluation Model

Selecting the right evaluation model is crucial to ensuring that your LLM evaluation approach is effective. The choice depends on how often you are iterating on evaluations (often quite frequently), cost versus accuracy requirements, and flexibility requirements.

  • Base Models: Many teams when starting on evaluations leverage the same models they are using for the LLM application. If a team is using GPT-4o or Claude Sonet or Gemini they will use the same model for evaluations. This gives a lot of flexibility and a place to baseline before moving to a different approach that trades off accuracy vs cost.

  • Base SLM Models: The SLM model versions of base models are a great choice for both evals and guardrails. The GPT-4o mini or Gemini/Gemma is 1/10 of the cost of base models and incredibly fast for guardrail type applications. These are a natural place to test against before moving to something more complex and more rigid like a fine tuned model.

  • LLM Fine Tune: A fine tune of a 1B Llama or Phi model can allow teams to apply a very specific eval, with a good set of language generalization at a reduced cost. We recommend moving to something like this as you have truly scaled out your application and have a set of very focused evaluations you are working to get reduced in total cost.

  • Fine-Tune Classifier (BERT Based): These models are incredibly cheap but represent a large trade off in generalization ability (BERT does not work across languages or situations well) versus flexibility (if you want to change you need to retrain) with low cost. If cost is your main objective and you are laser focused on a single language and use case, the BERT fine tune might make sense, though there are so many better options we recommend against.