Evals from OpenAI: Simplifying and Streamlining LLM Evaluation

Trevor LaViale
Trevor LaViale,  ML Solutions Engineer |  Published May 05, 2023

This blog is co-authored by Aparna Dhinakaran, Chief Product Officer and Co-Founder of Arize AI

OpenAI’s Eval Framework Is a Tool for Evaluating Large Language Models

Artificial Intelligence has made significant advancements over the past few months, with foundational models achieving impressive results on various use cases. The age of LLMs is definitely upon us; however, evaluating these models is often challenging, and researchers need to develop reliable methods for comparing different models’ performance.

A few months ago, OpenAI open-sourced their framework for evaluating LLMs against a series of benchmarks. This framework was used internally at OpenAI to ensure new versions of their models were performing adequately. OpenAI’s Eval Framework is a tool designed to help researchers and practitioners evaluate their LLMs and compare them to other state-of-the-art models.

openai evals

How Does the Eval Framework Work?

What Is An Eval?

Now at this point, you may be thinking, “Wow, this seems like a really useful tool for evaluating LLMs, but what is an eval and how do I use it?” Let’s dive into the specifics!

An “eval” refers to a specific evaluation task that is used to measure the performance of a language model in a particular area, such as question answering or sentiment analysis. These evals are typically standardized benchmarks that allow for the comparison of different language models. The Eval framework provides a standardized interface for running these evals and collecting the results.

At its core, an eval is a dataset and an eval class that is defined in a YAML file. An example of an eval is shown below (this was taken from the Github repository for evals):

  id: test-match.s1.simple-v0
  description: Example eval that checks sampled text matches the expected output.
  disclaimer: This is an example disclaimer.
  metrics: [accuracy]
  class: evals.elsuite.basic.match:Match
    samples_jsonl: test_match/samples.jsonl

Let’s break down what the above means:

  • test-match: This is the name of the eval
  • id: This is the full name of the eval that test-match is an alias for
  • description: Description of the eval.
  • metrics: The metrics for the eval.
  • class: Specifies the path for the module/class for the eval.
  • args: Anything you want to pass to the class constructor.
    • samples_jsonl: points to the location of where the samples are, which in this case are in test_match/samples.jsonl

To give you an idea of what the samples look like, these are the samples in the test_match/samples.jsonl file:

{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Once upon a "}], "ideal": "time"}
{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "The first US president was "}], "ideal": "George Washington"}
{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "OpenAI was founded in 20"}], "ideal": "15"}

Within the JSONL file (just a JSON file with a unique JSON object per line), we can see the samples for the eval. Each JSON object represents a task for the model to complete, and counts as 1 data point in the eval.  For more examples of JSONL files, you can go to registry/data/README.md in the Eval Github Repository.

In the section below, we’ll go over how to run the test-match eval.

How To Run An Eval

We can run the above eval with a simple command:

oaieval gpt-3.5-turbo test-match

Here we’re using the oaieval CLI to run this eval. We’re specifying the name of the completion function (gpt-3.5-turbo) and the name of the eval (test-match). It’s as easy as that! We’ll dive deeper into completion functions and how to build your evals in the section below. After running that command, you’ll see the final report of accuracy printed to the console, as well as a file path to a temporary file that contains the full report. This just goes to show how easy it is to quickly evaluate LLMs using this framework. Next, let’s learn how to build our own evals instead of using one already in the registry.

How Can I Build My Own Eval?

In this section, we’ll go over how to build an eval from an existing template, as well as explaining completion functions and how to build your own.

Building Evals

Building Samples

Here we’ll walk through how to build a custom eval using an existing template to speed up the work. (If you want to build a completely custom eval, here is a README from the Eval Github repository.)

The first step in building the eval is constructing the samples. The samples need to contain certain fields depending on the template that you choose to use. Each sample needs to contain an “input” field which represents the prompt, which is recommended to be specified in chat format. The other fields depend on what template you choose to use for the eval. As an example, let’s use the Match template. In this case, I’d need to specify the field “input” in chat format and “ideal”. This could look like the below:

{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Arize AI is a company specializing in ML "}], "ideal": "observability"}

This is telling the system to complete the phrase as concisely as possible, and the phrase we provided is “Arize AI is a company specializing in ML ” with the expected answer to be “observability.”

If you have the samples in a different file format from JSONL, OpenAI provides a CLI to convert those samples to a JSONL file. You can use the code below provided by the Evals repository to accomplish that:

openai tools fine_tunes.prepare_data -f data[.csv, .json, .txt, .xlsx or .tsv]

Great, we have our samples in a JSONL file! The next step is to register our eval.

Registering Your Eval

To register the eval, we need to add a file to evals/registry/evals/.yaml. The format of this file is the same format as the example test-match eval above. It needs to contain the eval name, id, optional description, metrics, class, and args that specify where the sample file is. Once we register the eval, we can go ahead and run it just like we ran the test-match eval. That’s all it takes to set up your own evals!

Building Completion Functions

What Is a Completion Function?

In the How can I run an eval? section, we briefly mentioned that you need to specify a completion function to run the oaieval command. First, let’s start with what a completion is. A completion is a model’s output to a prompt. For example, if the prompt we give the model is “Tell me what the capital of California is”, we expect the completion to be “Sacramento”. However, some prompts may require access to the internet or some other operations that help the model answer the question accurately, and this is where completion functions come into play. Completion functions allow you to define these operations that the model may need to perform. The completion function argument in the oaieval command can either be CompletionFn URLs, or the name of a model in the OpenAI API or key in the registry. More information on completion functions can be found here.

How can I Build My Own Completion Function?

In this section, we’ll go over how to build your own completion function. In order to make your completion function compatible with all evals, it needs to implement a few interfaces. These interfaces essentially just standardize the inputs and outputs for the eval. If you’d like to get more information on these interfaces, check out the docs on the Completion Function Protocol here.

Once your completion functions have been implemented, you need to register them similarly to how we registered our eval. Registering the completion function allows it to become available to the oaieval CLI. An example registration taken from the Evals repository is shown below:

  class: evals.completion_fns.cot:ChainOfThoughtCompletionFn
    cot_completion_fn: gpt-3.5-turbo

Let’s break down the above:

  • cot/gpt-3.5-turbo: This is the full name of the completion function that oaieval will use
  • class: This is the class path to the implementation of the completion function
  • args: Arguments passed to your completion function when initialized
    • cot_completion_fn: This is an argument passed to the ChainOfThoughtCompletionFn class

What Are the Advantages of Using the Eval Framework?

The Eval Framework provides several benefits to researchers and practitioners.

  • Standardized evaluation metrics and benchmarks: The Eval Framework provides a standardized set of evaluation metrics that researchers can use to compare their models’ performance. This allows researchers to compare their models to other state-of-the-art models on the same benchmarks.
  • Easy to use: The Eval Framework is designed to be easy to use. You can use existing templates to quickly build your own evals and get up and running with only a few lines of code as we’ve shown above.
  • Flexibility: The Eval Framework is flexible and can be used to evaluate models on a wide range of tasks and different benchmarks.
  • Open-source: The Eval Framework is open-source, which means that researchers/practitioners can use and modify it for their specific needs. Additionally, anyone can contribute to the openai/evals Github repository, which will help crowdsource even more benchmarks that can be shared across the community.


The Eval Framework is a powerful tool for evaluating AI models. It provides researchers with a standardized set of evaluation metrics and tasks that they can use to compare their models to other state-of-the-art models. The framework is easy to use and flexible, and it supports a wide range of tasks. As LLMs continue to improve, the Eval Framework will be an essential tool for evaluating and comparing their performance. Happy evaluating!