The three types
| Type | How it works | Where it shines | Where it doesn’t |
|---|---|---|---|
| Human | A reviewer scores outputs against a rubric. | Capturing nuance and subject-matter expertise; defining golden datasets that calibrate other evaluators. | Slow, expensive, inconsistent across reviewers, doesn’t scale. |
| Code-based | Deterministic Python (or built-in templates) check the output. | Anything that can be programmatically checked — string contains, JSON shape, regex match, deterministic comparison. | Anything subjective — tone, helpfulness, factuality. |
| LLM-as-a-judge | An LLM reads the data and emits a label and explanation. | Subjective qualities at scale, with minimal up-front data; flexible across domains. | Non-deterministic; requires prompt design; vulnerable to known biases. |
Human evaluation
Human evaluation is the calibration target for everything else. When you want to know whether your other evaluators are correct, you compare them to human-labeled data. What humans are good at:- Nuance and context. A human reviewer with domain knowledge can spot subtle factual errors, tone problems, or context-specific failures that pattern-matching can’t.
- Defining ground truth. Golden datasets — small, carefully-labeled collections of input/output pairs — are how every other evaluator gets validated. They almost always start with human labeling.
- Edge cases. Rare failure modes that no automated system was designed to catch.
- Scale. A human reviewer can read maybe a few hundred traces a day; a production application generates orders of magnitude more.
- Speed. Human turnaround is hours-to-days; automated turnaround is seconds-to-minutes.
- Consistency. Two reviewers, given the same rubric, will disagree on a non-trivial fraction of cases. One reviewer, given the same trace twice on different days, will sometimes disagree with themselves.
- Bandwidth. Humans cannot evaluate against four million tokens of context. Pattern-matching tasks like “find spans that match X criteria across the last month of data” are not human tasks.
Code-based evaluators
Code-based evaluators run pure Python (or a built-in template) against the output. They are fast, cheap, deterministic, and repeatable — which makes them the right choice whenever the question can be expressed as a check. What code is good at:- Deterministic criteria. “Does the output contain a competitor’s name?” “Is the JSON shape correct?” “Did the function call use the right parameter names?” Anything where a correct/incorrect answer is fully determined by the data.
- Reliability. A code evaluator gives the same answer every time. No retry loops, no rate limits, no model drift.
- Speed and cost. Running a code evaluator is essentially free compared to an LLM call.
- Anything subjective. Tone, factuality, helpfulness, coherence — these can’t be checked by pattern-matching.
- Anything that needs reasoning. “Is the response consistent with the retrieved context?” requires understanding both pieces, which code can’t do.
LLM-as-a-judge
LLM-as-a-judge uses a pre-trained LLM to read the data and emit a label, score, and explanation. What LLM-as-a-judge is good at:- Subjective evaluation at scale. Tone, helpfulness, factuality, coherence — questions that need reasoning.
- Flexibility. A new evaluator is a new prompt, not a new model.
- No training data required up front. You’ll want a small golden dataset to validate the judge, but you don’t need thousands of examples to get started.
- Determinism. The same judge, run twice on the same input, can disagree with itself.
- Cost-free operation. Every eval is an LLM call. At production volume, this matters — see Filters, scope, and cadence for the sampling levers.
- Resistance to bias. Verbosity bias, position bias, self-enhancement bias — see Evaluator best practices for the catalog and how to guard against them.
Choosing between types
A decision tree:- Is the question deterministic? (Yes = code-based. Done.)
- Do you need to evaluate at production scale? (No = human is fine for small samples.)
- Default for everything else = LLM-as-a-judge.