Evaluations (evals)

Evals, or evaluations, are structured tests for measuring the quality of a system, process, or outcome.

In AI applications, evals help teams measure whether a model, agent, or workflow is behaving as expected. An eval usually includes the thing being evaluated, criteria for judging it, and a scoring method that returns a label, score, or explanation.

An eval usually takes an input, an output, and a scoring method, then returns a label (such as "Hallucination – Hallucinated (1), Not Hallucinated (0)") or score with an explanation. The scoring method might be an LLM judge, a deterministic code check, a human review workflow, or a mix of all three.

Running a few evals is not the same as having an evaluation practice. A real practice is repeatable, connected to traces and datasets, and wired into the places teams already ship software: development, pull requests, CI/CD, staging, production monitoring, and incident response.

Most importantly, evals are a foundational way of determining if your agent is still working the way you intended. Without evals, agents can regress quietly. A prompt change, model upgrade, tool update, or retrieval change can look harmless and still break the workflow users depend on. Evals can also be used within CI/CD (continuous integration, continuous delivery) pipelines as automated checks that run in production.

Docs

Learn

Insights

Company

Docs

Learn

Insights

Company

What Are Evaluations (Evals)?