An evaluation harness is the operational system that turns evals into repeatable workflows and actions.
A harness is more than a single evaluator or judge prompt. It defines what gets evaluated, how scoring happens, where results are stored, and what happens after those results are produced.
An evaluation harness has three main parts:
- **Evaluation inputs** define what gets evaluated. Inputs can include traces, spans, agent trajectories, sessions, production examples, or offline datasets.
- **Evaluation execution** defines how scoring happens. Execution can use LLM judges, deterministic checks, embedding similarity, custom scorers, human review, or agent-based evaluation.
- **Evaluation actions** define what happens next. Actions can trigger alerts, route examples to annotation queues, block deployments, open tickets, start experiments, or feed failures into improvement workflows.
A good evaluation harness makes evals repeatable, comparable, and connected to the way teams actually build and operate AI systems. It helps teams detect failures, compare changes, prevent regressions, and improve behavior over time.