Best Practices for Selecting the Right Model for LLM-as-a-Judge Evaluations

Published September 30, 2024

When building and scaling LLM-based applications, ensuring model performance is critical. One powerful method for evaluating that performance is using an LLM as a judge. This allows you to systematically assess your model’s output and detect potential issues like hallucinations, incorrect responses, or inconsistencies that can arise as the model starts handling real-world data. In this post, I’ll walk you through best practices on how to select the right evaluation model for LLM-as-a-judge scenarios.

Want to skip to the code? Check out the notebook here!

Why Use LLMs to Evaluate Other LLMs?

Using an LLM to evaluate other models can save significant time and effort, especially when scaling. For instance, an LLM as a judge can automatically identify hallucinated responses or irrelevant content by leveraging specific evaluation prompts. These prompts can be pulled from Arize’s Phoenix library, which provides templates tailored for different evaluation needs.

Step 1: Start with a Golden Dataset

The foundation of any evaluation process is a golden dataset. This dataset contains user queries, responses, and ground truth values—often human-annotated answers that serve as a baseline for comparison. By comparing the model’s output against this ground truth, I can measure performance metrics across the board.

For example, I recently worked on a document relevancy task where the application generates responses based on a user’s query and a set of reference documents. My goal was to determine whether the reference document was relevant to the query. This golden dataset provided the necessary framework to evaluate how well the model performed on this task.

Step 2: Choosing the Evaluation Model

One of the key decisions in the process is choosing the right model to act as the evaluator. In my recent evaluations, I compared three models: GPT-4, GPT-3.5 Turbo, and Claude 3.5 Sonnet.

GPT-4 emerged as the top performer, achieving an accuracy of 81%. It balanced precision and recall across both relevant and irrelevant categories, making it a reliable choice for production-scale applications.
GPT-3.5 Turbo, while more cost-effective, lagged in performance, with an accuracy of 55%. It particularly struggled to detect irrelevant items, with a recall score of just 7%. This model may work well when cost is a primary concern, but it’s prone to missing irrelevant cases, which can be problematic depending on the task.
Claude 3.5 Sonnet landed between the two, with an accuracy of 77%. However, it tended to over-predict the relevance of examples, potentially increasing the number of false positives.

Step 3: Analyzing the Results

Once the models have been run through the evaluations, the next step is to compare their performance metrics. For my document relevancy task, GPT-4 stood out, with strong metrics across the board. It demonstrated high accuracy, precision, and recall, making it the clear choice for this particular use case.

On the other hand, GPT-3.5 Turbo’s significantly lower accuracy and poor recall for irrelevant examples made it less suitable for the task, despite its lower cost. Meanwhile, Claude 3.5 Sonnet was a decent middle ground but had a tendency to overestimate relevancy, which could lead to misleading results.

Step 4: Adding Explanations for Transparency

After selecting the evaluation model, I often like to go a step further by enabling explanations in the evaluation process. Explanations provide more clarity into why the model made certain decisions. In my case, I updated the prompt template to include the “provide explanation” flag, allowing GPT-4 to generate justifications for its evaluations.

This added layer of transparency is especially useful when spot-checking the model in production, as it helps to ensure the model is making accurate decisions. With explanations enabled, I can easily see the reasoning behind each evaluation, which gives me greater confidence in the results.

Step 5: Monitoring Performance in Production

Once the evaluation model has been chosen, the work doesn’t stop. Continuous monitoring is essential to ensure the model continues to perform well as new data flows in. Using tools like Phoenix’s tracing page, I can visualize and track evaluation results in real-time, making it easy to catch any potential issues early and make adjustments as needed.

Wrapping Up

Selecting the right LLM to act as an evaluator is a crucial step in ensuring the quality of your application’s output. By starting with a golden dataset, carefully selecting and comparing models, adding explanations for transparency, and monitoring performance in production, you can confidently choose the best model for your use case.

In my evaluations, GPT-4 consistently demonstrated the best overall performance, balancing accuracy, precision, and recall. However, depending on your specific needs—whether it’s cost-efficiency or a tolerance for a slight trade-off in accuracy—other models like GPT-3.5 Turbo or Claude 3.5 Sonnet may also be worth considering.

If you’re looking for pre-built prompt templates or want to start running LLM-as-a-judge evaluations, check out Arize’s Phoenix library for a variety of helpful resources.