Your evals are only as good as your criteria
Automated evals are only as good as what they measure. Before writing criteria, review real interactions and understand how a human would judge them - then build evals that reflect that standard.
Start from human labels
Use Human review to define annotation configs, review traces or dataset rows, and build a ground truth set that reflects your rubric. These labels become the ground truth reference you compare eval scores against.Measure agreement
On a fixed sample of examples (typically 50 to a few hundred, covering edge cases):- Annotate dataset
- Configure your evals with the same choices as your annotation config
- Navigate to the Prompt Playground and select your dataset and evaluator
- Set up a second eval to compare results, either the Exact Match code eval or the Human vs AI eval. You will be comparing the experiment output to the ground truth annotation column in your dataset
- Run the experiment
- Refine evaluator prompt




Common issues
- High human disagreement: if annotators disagree with each other, evals cannot align to a single standard until the rubric is clarified
- Small calibration sets: a handful of rows can miss long-tail failures. Aim for at least 50 to 100 labeled examples before trusting metrics or changing production monitors
- Criteria mismatch: your evals may be scoring a different dimension than your annotations (e.g. fluency vs factual accuracy)