Navigating LLM Leaderboards: An Overview of LLM Benchmark Evaluations

sallyann delucia arize team member
Sally-Ann DeLucia,  ML Solutions Engineer  | Published October 10, 2023

Introduction

Say you’ve identified a compelling use-case and maybe even successfully tested a proof of concept (POC). Now, the real challenge begins – choosing the right model for production deployment. Navigating the ever-expanding universe of large language models (LLMs) can seem overwhelming. New models appear weekly, each boasting about its superior performance. But discerning the true state of the art model in this fast-paced landscape and what model best suits your specific use-case can be a difficult task. You might have encountered the numerous LLM learboards that exist, but what do they truly mean? How can they guide you in making an informed choice? What additional factors should you keep in mind? Let’s delve deeper into these questions and provide you with practical tools to make informed decisions.

What Are LLM Leaderboards?

LLM leaderboards aim to track, rank, and evaluate LLMs and chatbots as they are released. This is a crazy time in machine learning where new models are being released almost weekly. This makes it difficult to decide which model(s) to evaluate for your task. Due to the number of available options, testing each one is unrealistic as it would take too much time. The leaderboards provide you with the insights that allow you to make that initial decision. Leaderboards typically provide engineers with a variety of benchmarks and the model’s associated performance. These benchmarks give you an indication of how that given model might perform on your task.

These LLM leaderboards are the end product but it’s important to remember that the benchmarks powering them are invaluable to the community. Benchmarks such as the ElthuerthAI Eval Harness and StanfordHelm are open, standardized, and reproducible — without them, comparing results across models and papers would be difficult; research on improving LLMs would similarly certainly be stifled without them.

Are LLM Leaderboards Misleading?

While they encaspulate the latest results on the “general fitness” of a given model from standardized evaluation frameworks and how well it does at a variety of tasks, LLM leaderboards are not necessarily built for evaluating how well a model is suited for a specific application or your narrower industry use case. Often, task-specific LLM benchmarks and system evaluations are needed before any production deployment of an LLM application to ensure effectiveness. With a goal of analyzing the performance of the model using LLM as a judge (AI-assisted evals), task-based evaluations are often pragmatic necessities for AI engineers.

Category Model Assessments Task Assessments
Foundation of Truth Relying on established benchmark datasets. Depends on a golden dataset curated internally, augmented with LLMs.
Nature of Questions Utilizes standardized questions for broad capability evaluation. Adapts task-specific prompts for diverse data scenarios, simulating real-world conditions.
Frequency and Purpose Conducted as a one-time evaluation for general abilities using benchmarks. An iterative process, applied repeatedly for system refinement and real-world adaptability.
Value of Explanations Typically, explanations lack actionable insights; emphasis is on outcomes. Provide actionable insights for performance understanding and enhancement in specific contexts.
Persona LLM researcher assessing new models; ML practitioner selecting models. ML practitioner throughout the application’s lifespan, refining and adapting as

For additional insight, here’s a piece to dive deeper into LLM model versus task evaluations for LLM benchmarking.

LLM Benchmarks

A robust LLM benchmark relies on a golden dataset with ground truth labels to serve as a standard of comparison for a given task.

Generally, new models are tested on a variety of benchmarks but papers rarely include the necessary code or sufficient detail to replicate their evaluations. The performance reported is often governed by minor implementation details and it’s unrealistic to expect results from one codebase to transfer directly to another.

To address this major challenge, unified frameworks have been developed to allow for a ground truth location to evaluate new LLMs. This protects practitioners time but also ensures that results can be compared against any previous work. In a way these evaluation harnesses are setting the benchmark for auditing LLMs. ElthuerthAI Eval Harness and StanfordHelm are two community favorites that provide a unified way to evaluate these models in an transparent and reproducible way.

HuggingFace Leaderboard

One LLM Leaderboard that is popular amongst engineers and data scientists is hosted by Hugging Face. This resource not only allows you to quickly evaluate open source LLMs with up-to-date metrics, but also allows anyone from the community to submit a model to be evaluated. This means that if you’re opting to build your model from scratch, Hugging Face will allow you to evaluate it making it easy to see how it holds up against other open source models out there. Another great benefit of the Hugging Face LLM Leaderboard is that all the results are easily reproducible and well documented.

Benchmarks Used

HuggingFace relies on Eleuther AI Language Model Evaluation Harness, a unified framework designed to test generative language models developed by EleutherAI, a non profit AI research label. Essentially, the leaderboard operates as an interface for this open-source framework. HuggingFace assesses open LLMs based on four main benchmarks from the Eleuther framework, selecting them for their ability to gauge reasoning and general knowledge across diverse fields.

AI2 Reasoning Challenge

The AI2 Reasoning Challenge (ARC) is designed as a benchmark dataset for advanced question answering, focusing on questions that require deeper knowledge and reasoning rather than simple retrieval-based methods. The dataset consists of 7,787 grade-school science questions, divided into a Challenge Set and an Easy Set. The Challenge Set contains questions that are difficult to answer with basic retrieval or co-occurrence algorithms, making them suitable for testing advanced QA methods.

ARC aims to address the limitations of previous datasets that mainly focused on factoid-style questions with surface-level cues. By providing challenging questions that require reasoning, commonsense knowledge, and deeper text comprehension, ARC encourages researchers to explore more complex AI challenges.

Types of knowledge suggested by ARC Challenge Set questions
Types of reasoning suggested by ARC Challenge Set questions

HellaSwag

HellaSwag is a challenge dataset for commonsense natural language inference (NLI). It aims to evaluate the ability of LLM models to perform commonsense reasoning. The dataset is constructed using Adversarial Filtering (AF), making it difficult for state-of-the-art models while remaining easy for humans. HellaSwag includes contexts from ActivityNet Captions and WikiHow articles, offering diverse and complex scenarios. It serves as a benchmark to assess the robustness of models in handling real-world reasoning tasks, particularly when dealing with long and complex contexts. Researchers interested in testing the commonsense reasoning capabilities of LLMs models can use HellaSwag as a challenging evaluation dataset.

Example questions with answers from BERT Large.

MMLU

The Multitask Multidomain Language Understanding (MMLU) benchmark introduces a comprehensive test to evaluate text models’ knowledge across 57 diverse subjects, covering humanities, social sciences, STEM, and more. It assesses models’ ability to learn from pre-training by evaluating them in zero-shot and few-shot settings, resembling human learning. The benchmark includes questions from freely available sources, and human-level accuracy on MMLU varies, with unspecialized humans obtaining 34.5% accuracy. MMLU can be used as a benchmark to identify knowledge gaps and limitations in large language models.

Examples from the Conceptual Physics and College Mathematics STEM tasks.

TruthfulQA

TruthfulQA is a benchmark designed to evaluate the truthfulness of language models in generating answers to diverse questions. The benchmark includes 817 questions covering 38 categories and targets “imitative falsehoods,” which are false answers that resemble common misconceptions found in the models’ training data. The goal is to assess the likelihood of models producing false or deceptive information without task-specific fine-tuning.

Researchers and developers can use TruthfulQA to measure how truthful their language models are and identify potential weaknesses that lead to false answers. The benchmark allows for model comparison, enabling users to understand which models perform better in providing accurate and truthful responses. It also serves as a test of generalization, examining models’ ability to maintain truthfulness across various contexts and topics. Overall, TruthfulQA provides valuable insights into the truthfulness of language models and aids in ensuring their reliability and trustworthiness in generating factual information.

TruthfulQA questions with answers from GPT-3-175B with default prompt.
TruthfulQA questions with answers from GPT-3-175B with default prompt.

Evaluating Benchmark Performance

When looking at the LLM Leaderboards, you may want to focus on the benchmark(s) that make the most sense for your particular use case. For example, if you’re interested in a model’s reasoning abilities and tackling more complex question-answering tasks, you can use the ARC dataset and corpus as a benchmark and starting point for evaluating the model. If you want to assess a model’s limitations, you might want to observe the TruthfulQA or MMLU score. If you want to understand the robustness of the model, you might pay more attention to the HellaSwag. It’s important to note that for HuggingFace’s evaluations, the higher the score, the better the model performed on that task. These benchmarks are not all encompassing, so you’ll likely want to evaluate multiple benchmarks before choosing your model.

Licenses

When evaluating models on the LLM leaderboard you’re going to want to pay attention to the license associated with the model. Different licenses have different provisions so you’ll want to make sure the model you choose is in line with your intended use of it. Below is a helpful table to help you make sense of some of the licenses you might find on LLM leaderboards.

License Name Description
apache-2.0 End-users can utilize the Apache 2.0 license in any commercially licensed software or enterprise application for free
bigcode-openrail-m Designed to permit free and open access, re-use, and downstream distribution of derivatives of AI artifacts, for research, commercial or non-commercial purposes, as long as the use restrictions present in the license always apply.
bsd-3-clause Allows you almost unlimited freedom with the software so long as you include the BSD copyright and license notice in it.
cc Creative Commons license family
cc-by-nc4.0 This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.
cc-by-nc-nd-4.0 This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
cc-by-nc-sa-4.0 This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.
cc-by-sa-3.0 This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.
cc0-1.0 Public dedication tool, which allows creators to give up their copyright and put their works into the worldwide public domain. CC0 allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, with no conditions.
creativeml-openrail-m Same provisions as bigcode-openrail-m
gpl GNU General Public License family
gpl-3.0 A free copyleft license that ensures users’ freedom to share, modify, and distribute software and other works. It allows users to run, copy, distribute, modify, and propagate the covered works, subject to the terms and conditions specified in the license.
llama2 Allows you to use, reproduce, distribute, modify, and create derivative works of the Llama 2 large language models and software provided by Meta. The license is non-exclusive, worldwide, non-transferable, and royalty-free, but if you have more than 700 million monthly active users for your products or services, you need to request an additional commercial license from Meta.
MIT An MIT licensed software can be redistributed under any other license if the original copyright and license notice is included in the software license. The licensed software can be used, distributed, and modified to any extent; even for proprietary purposes.
openrail OpenRAIL license family

For more information on licensing check out this awesome primer.

Steps for Submitting Your Own Model To the Hugging Face Open LLM Leaderboard For Evaluation

The first thing you’ll want to consider before submitting your model to Hugging Face’s leaderboard is your license. Their leaderboard is for open source models so you’ll want to make sure the model you developed has an open source license.

The next step is to load your model to Hugging Face and fill out your model card to add extra information about models to the leaderboard. You’ll also want to make sure the model is set to Public.

Make sure you can load your model and tokenizer using AutoClasses. If you run into issues with this part, follow the error you received to debug your model. There is likely something wrong with how the model was uploaded to your Hugging Face repo.

from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)

Convert your model weights to safetensors, a new format for storing weights which is safer and faster to load and use. It will also allow the number of weights of your model to be added to the Extended Viewer.

Finally, submit your model to the leaderboard for evaluation.

Summary

Selecting the ideal model for your unique application is a complex process. Amidst the near weekly influx of new models, distinguishing the state-of-the-art models is extremely challenging. Fortunately, the community has developed benchmarks and leaderboards that streamline the process of comparing and evaluating LLMs. These leaderboards present up-to-date metrics based on unified performance benchmarks that test on a wide array of reasoning abilities and general knowledge tasks. You’ll want to be sure you keep your use case in mind while reviewing the benchmarks and licenses for models to ensure you are choosing the one that is best aligned with your desired application.