Skip to main content

Your evals are only as good as your criteria

Automated evals are only as good as what they measure. Before writing criteria, review real interactions and understand how a human would judge them - then build evals that reflect that standard.
Arize AX Playgrounds view for an align eval task with a prompt editor using GPT-3.5 to classify clarity and tone, and an experiment table comparing human annotation labels to Human v AI align badges showing aligned or not aligned per row with an average agreement score

Start from human labels

Use Human review to define annotation configs, review traces or dataset rows, and build a ground truth set that reflects your rubric. These labels become the ground truth reference you compare eval scores against.

Measure agreement

On a fixed sample of examples (typically 50 to a few hundred, covering edge cases):
  1. Annotate dataset
  2. Configure your evals with the same choices as your annotation config
  3. Navigate to the Prompt Playground and select your dataset and evaluator
  4. Set up a second eval to compare results, either the Exact Match code eval or the Human vs AI eval. You will be comparing the experiment output to the ground truth annotation column in your dataset
  5. Run the experiment
  6. Refine evaluator prompt
Use a second LLM-as-judge or exact match code eval to compare the output your primary eval produces with the ground truth column from your dataset.
Create eval UI for Human v AI align with span scope, Claude judge prompt comparing expert ground truth to model output, choice labels correct and incorrect, and optional test mapping to a dataset
You can also compute alignment using an exact match code eval.
Edit eval for an exact_match code eval showing ExactMatch eval class, imports, signature, and Configure Task Mappings with dataset preview and output column mapping
Playground align eval session with prompt editor and model selector, floating menu open on Evaluator with From Evaluator Hub list to load a saved evaluator, and dataset table with input variables and annotation labels
Playground experiment table comparing annotation labels to Human v AI align eval tags showing aligned or not aligned per row with aggregate agreement score
You can also ask Alyx to compare your eval results against your human annotations and flag where they differ. For example: “I want to align my evals. Compare the experiment output against the ground truth column in my dataset (annotation.correctness.label) and return aligned or not aligned for each row.”

Common issues

  • High human disagreement: if annotators disagree with each other, evals cannot align to a single standard until the rubric is clarified
  • Small calibration sets: a handful of rows can miss long-tail failures. Aim for at least 50 to 100 labeled examples before trusting metrics or changing production monitors
  • Criteria mismatch: your evals may be scoring a different dimension than your annotations (e.g. fluency vs factual accuracy)

Troubleshooting

If agreement is low but humans are consistent, iterate the judge prompt and confirm your variable mappings match the fields humans reviewed. If scores look good on average but fail on a specific slice, stratify your sample by product area, language, or tool-use path and recheck alignment per slice.

Further reading