Offline Evals

Offline evaluations help you understand model or application performance using existing data. Instead of testing in production, you can experiment safely on curated datasets or historical project data in Arize. Each evaluation runs in a controlled, repeatable environment, allowing you to measure how new versions behave before exposing them to real users.

Why is Offline Evaluation important?

Offline evals are a cornerstone of evaluation-driven development. They allow you to track how changes in your prompts, models, or logic affect quality as you build. This makes it easier to catch regressions early, validate improvements, and maintain confidence that every new iteration improves performance.

By continuously running offline evals during development, you can:

  • Specify evaluation criteria that align with your expectations or use case

  • Compare versions side by side to see what is improving (or not)

  • Move faster with a structured workflow that is consistent and measurable

Evaluate experiment

Getting Started

To get started with offline evaluations, run experiment evals to systematically test changes in your applications. Create an experiment where you define a dataset and one or more evaluators. As you run your system on this data, the evaluators measure performance, allowing you to make changes and see how those updates impact results.

Last updated

Was this helpful?