Continuous Improvement

Chapter Summary

This chapter explores self-improving evaluations and iterative processes for refining evaluation frameworks. Techniques like feedback loops, few-shot prompting, and fine-tuning models are discussed to enhance evaluation accuracy and adapt to evolving use cases.

Develop self-improving evaluations and feedback loops using tools detailed in our product documentation.

Self-Improving Evaluations

In the world of LLM evaluation, self-improving evaluations represent an approach where models not only undergo testing but also learn from their mistakes and continuously improve their own evaluation methods. The idea behind this is to create an adaptive evaluation system that refines itself over time, leading to more accurate assessments of LLM performance.

“Create an adaptive evaluation system that refines itself over time, leading to more accurate assessments of LLM performance.”

1. Continuous Learning from Errors

Self-improving evaluations work by systematically identifying areas where the model performs poorly and using that feedback to adjust future evaluations. For example, if a model frequently makes errors in answering complex questions, the evaluation framework can highlight those areas and provide focused feedback to guide future adjustments.

This approach allows the evaluation framework to learn and evolve, enabling:

  • Identification of weak points: Pinpointing tasks or datasets where the model struggles the most.
  • Dynamic refinement: Adjusting criteria or datasets by incorporating harder examples or edge cases.
  • Self-correction: Updating metrics or test cases based on performance, continuously raising the standard for evaluation.

2. Feedback Loops

A crucial aspect of self-improving evaluations is the use of feedback loops. These loops allow models to learn from the evaluations themselves. Feedback on where the model underperforms can guide fine-tuning or retraining efforts, leading to more effective performance improvements.

By embedding feedback loops into the evaluation process, models become more robust over time. This is particularly important for models that will encounter changing environments or evolving datasets, as they need to adapt in real time.

Improving LLM Evaluation Systems

Improving LLM evaluation systems is often achieved by iterating on the prompt template used, adding few-shot examples, or fine-tuning the underlying evaluation model.

Few-Shot Prompting Evaluations

Few-shot prompting refers to the practice of providing an LLM with a small number of examples (or shots) to help it understand the task at hand. The goal is to assess how well the model can generalize and perform based on limited input data.

1. Setting Up Few-Shot Prompting Evaluations:

The model is given a few labeled examples in the prompt to guide its response. For example, in evaluating a summarization model, you might provide two or three examples of text along with their summaries before asking the model to generate a summary for a new text. The evaluation then measures how well the model generalizes to unseen data.

This approach is essential for:

  • Rapid prototyping: Testing how the model handles new tasks with minimal input.
  • Generalization: Assessing the model’s ability to extrapolate from limited data.

2. Curating a Dataset for Few-Shot Evaluations:

Curating a dataset for few-shot evaluations involves balancing variety and representativeness. Few-shot datasets must be carefully selected to represent the full scope of the task. Approaches include manually curating examples, hand-writing them, or using LLMs to generate synthetic examples.

  1. Diverse examples: Cover different challenges the model may encounter.
  2. Edge cases: Include rare or tricky scenarios to test limits.
  3. Consistency in ground truth: Ensure correct and well-defined answers for evaluation.

By carefully curating a few-shot dataset, you allow the model to demonstrate its ability to quickly learn and adapt to new information, which is a critical component in measuring its overall robustness.

“Few-shot datasets must be carefully selected to represent the full scope of the task.”

3. Updating Prompts with Examples:

Examples for prompts can be selected based on cosine similarity to the query or synthesized by an LLM. These examples help refine the prompt, address edge cases, and improve precision.

You can also use an LLM to summarize the examples and insert them as additional instructions. As you add additional examples, you can start catching more and more edge cases, add them to your prompt, and re-test them against your golden dataset to ensure reliability.

In this task, you will be presented with a query, a reference text and an answer. The answer is generated to the question based on the reference text. The answer may contain false information. You must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not contain made up information.

Use the examples below for reference:
{examples}

Here is the query, reference, and answer.
   # Query: {query}
   # Reference text: {reference}
   # Answer: {response}

Is the answer above factual or hallucinated based on the query and reference text?

4. Fine-Tuning the Evaluation Model:

The last step is fine-tuning the evaluator. Fine tuning the evaluator model is similar to fine tuning the LLM used for the application, and can be done using the data points collected earlier.

This also allows teams to use smaller language models, which reduces latency and cost while maintaining similar levels of performance. As the dataset of corrections increase, AI engineers can connect their evaluator to the CI/CD pipeline and continuously run fine tuning jobs to increase the precision of their evaluator.