Why You Should Not Use Numeric Evals for LLM As a Judge

Aparna Dhinakaran, Co-founder & Chief Product Officer | Published March 03, 2024

This article is co-authored by Evan Jolley

In addition to generating text across an increasing number of industry use cases, LLMs are widely being used as evaluation tools. Models quantify the relevance of retrieved documents in retrieval systems, gauge the sentiment of comments and posts, and more – evaluating both human and AI-generated text in quite a variety of contexts. These evaluations are often either numeric or categorical.

evaluation types categorical numeric

Numeric evaluations involve an LLM returning a number based on a set of evaluation criteria. For example, a model could be tasked with how relevant a document is to a user query on a scale of one to ten.

A categorical evaluation is different in that it allows an LLM to choose from a set of pre-defined, often text-based options to choose from in its evaluation. For example, a prompt might ask if a passage was “happy,” “sad,” or “neutral” rather than trying to quantify the passage’s happiness level.

We set out to test several major LLMs on how well they conduct numeric evaluations. The results are clear: continuous scale score evaluations do not usually work. This testing spans several major LLMs: OpenAI’s GPT-4, Anthropic’s Claude, and Mistral AI’s Mixtral-8x7b.

All code run to complete these tests can be found in this GitHub repository.

Research Takeaways

Numeric score evaluations across LLMs are not consistent, and small differences in prompt templates can lead to massive discrepancies in results.
Even holding all independent variables (model, prompt template, context) constant can lead to varying results across multiple rounds of testing. LLMs are not deterministic, and some are not at all consistent in their numeric judgements.
We don’t believe ChatGPT, Claude, or Mixtral (the three models we tested) handle continuous ranges well enough to use them for numeric score evals.

The Research

Spelling Corruption Experiment

The first experiment was designed to assess an LLM’s ability to assign scores between 0 and 10 to documents based on the percentage of words containing spelling errors.

We took a passage of correctly spelled words, edited the text to include misspelled words at varying frequencies, and then fed this corrupted text to an LLM using this prompt template:

prompt template large language model evaluation spelling

We then ask the model to return a numeric eval corresponding to the percentage of words in the passage that are misspelled (3 → 30% misspelled, 8 → 80%, etc.). Ideally, a score of 10 would indicate that every word in a document is misspelled, while a score of 0 would mean there are no spelling errors at all. The results of the experiment across three LLMs - GPT-4, Claude, and Mixtral -are less than stellar.

Above, you can see that observed results are far from the expected perfect linear range; the scoring system does not consistently reflect the proportion of spelling errors in the documents. In fact, GPT-4 returns 10 (which represents a 100% error rate) for every document with percent of density of corruption at or above 10%. The reported scores are the median of multiple trials conducted at each specified level of error.

GPT-4, Claude, Mixtral spelling corruption LLM evaluation results — *GPT-4, Claude, Mixtral spelling corruption results*

The results from Claude are slightly better, but still not perfect or at a level acceptable for deployment. Mixtral, the smallest of these three models, performs best. After strong performances here and in our needle in a haystack LLM analysis, Mixtral continues to impress!

So why does this matter? Given LLMs are currently being used as numeric evaluators in a variety of settings, there are good reasons to wonder whether the models are up to the task. Products that use LLMs in this way may run into roadblocks with performance and customer satisfaction. If you are currently using LLMs for numeric evaluation or are considering doing so, we strongly recommend going a different direction.

Emotional Qualifier Experiments

The second and third experiments conducted were designed to assess an LLM’s ability to assign scores between 0 and 10 to documents based on the amount of sentences within the text that contained words that indicated sadness or frustration.

In these tests we embedded phrases and words into text that imparted a sense of sadness/frustration within the passage. The model was asked to quantify how prevalent the emotion was in the text, with 1 corresponding to no sentences conveying the emotion and 10 corresponding to 100% of sentences conveying the emotion.

These experiments were conducted alongside the spelling test to determine if shifting the model’s focus from word count to sentence count would impact the results. While the spelling test scored based on the percentage of misspelled words, the sadness/frustration tests scored based on the percentage of emotional sentences.

The instruction at the beginning of the prompt template varied between tests while everything beginning with the context remained the same, indicated by the ellipses:

qualifier test template llm evaluation — *Frustration qualifier test template*

Sadness qualifier test template llm evaluation — *Sadness qualifier test template*

Again, a score of 10 should indicate that every sentence in a document contains sadness or frustration qualifiers, while a score of 0 would mean there are none present. Scores in between were expected to correspond to varying degrees of the emotion frequency, with higher scores representing a greater proportion of emotional sentences. As before, we ran these experiments across ChatGPT-4, Claude, and Mixtral — and the results were worse than we expected.

llm numeric evals GPT-4 spelling corruption, sadness, frustration results — *GPT-4 spelling corruption, sadness, frustration results*

Similar to the spelling corruption experiment, results show a significant discrepancy from the expected outcomes. GPT-4 gives every document with sadness rates above 30% or frustration rates about 70% a score of 10. Remarkably, out of all of the tests run with GPT-4, the only times the median answer satisfies a perfect linear range is when there are no qualifiers or misspelled words present at all.

llm numeric evals Mixtral spelling corruption, sadness, frustration results — *Mixtral spelling corruption, sadness, frustration results*

Mixtral performs surprisingly well across the emotional qualifier experiments. While there are good reasons to doubt any of these models currently handle continuous ranges well enough to use them for numeric score evals, Mixtral is the closest to accomplishing that feat.

While the models tested may handle score evals in sentence-based tasks marginally better than word-based, they’re still not able to perform at a level that we are confident in. As a result, we do not recommend score evals in production code.

Variance In Results

It is worth noting that we ran these tests several times for each model and charted the distribution of their responses.

An ideal distribution would be tight around the low and high ends (high confidence if all or none of the words/sentences were counted) and perhaps a longer transition region in the middle (e.g. lower confidence differentiating between 4 and 5).

Two things stand out here. First, the tightness of distributions is quite different across models and tasks. Claude’s distributions range considerably over our trials; we have examples of the model consistently assigning 1–4 at 80% corruption, for example. On the other hand, GPT-4 has much tighter distributions - albeit at values that for the most part did not satisfy reasonable expectations.

Second, some models are better at handling transitions in continuous ranges than others. Mixtral’s distributions look like they are getting close to where an acceptable performance might be, but all three models seem to have a ways to go before they are ready for production.

Industry Implications for LLM Evals

There is currently a lot of research currently being done on LLM evaluations. Microsoft’s GPT Estimation Metric Based Assessment (GEMBA), for example, examines the ability of different large language models to evaluate the quality of different translation segments and finds that GPT-3.5 and larger models, such as Davinci-002 and Davinci-003, demonstrated “state-of-the-art capabilities” in translation quality assessment. We were surprised to see multiple papers reporting high success rates in numeric evals and would love to dive in deeper with the authors at some point.

While some research papers use probabilities and numeric scores as part of evaluation output - with GEMBA and others even reporting promising results - the way we see customers applying score evals in the real world is often much different from current research. With that in mind, we attempted to tailor our research to these more practical, real-word applications - and the results highlight why the use of scores directly for decisions can be problematic. Considering GPT-4’s responses in our score evals research, it seems as though the model wants to choose one of two options: 1 or 10, all or nothing.

Ultimately, categorical evaluation (either binary or multi-class) likely has a lot of promise and it will be interesting to watch this space.

Conclusion

Using LLMs to conduct numeric evals, while increasingly popular, is finicky and unreliable. Switching between models and making small changes in prompt templates can lead to vastly different results, making it hard to endorse LLMs as consistently reliable arbiters of numeric evaluation criteria. Furthermore, large distributions of results across continued testing showcase that these models are often not consistent in their responses, even when independent variables remain unchanged.

We discourage any readers building with LLM evals from using numeric evaluations in the manner used in this research. Further articles detailing LLM evaluation methods to follow.

Arize AX

Arize Phoenix

Learn

Insights

Company