Offline Evals

Offline evals run on datasets to help you test changes before deploying to production. Each evaluation runs in a controlled, repeatable environment, allowing you to measure how new versions behave before exposing them to real users.

Why is Offline Evaluation important?

Offline evals are a cornerstone of evaluation-driven development. They allow you to track how changes in your prompts, models, or logic affect quality as you build. This makes it easier to catch regressions early, validate improvements, and maintain confidence that every new iteration improves performance.

By continuously running offline evals during development, you can:

  • Specify evaluation criteria that align with your expectations or use case

  • Compare versions side by side to see what is improving (or not)

  • Move faster with a structured workflow that is consistent and measurable

Evaluate experiment

Getting Started

To get started with offline evaluations, run experiment evals to systematically test changes in your applications. Create an experiment where you define a dataset and one or more evaluators. As you run your system on this data, the evaluators measure performance, allowing you to make changes and see how those updates impact results.

Last updated

Was this helpful?