Skip to main content
The level of an evaluator tells you what slice of data it looks at. The type tells you how it makes its judgment. Three kinds of evaluators exist, and they aren’t interchangeable — each is suited to a different class of question.

The three types

TypeHow it worksWhere it shinesWhere it doesn’t
HumanA reviewer scores outputs against a rubric.Capturing nuance and subject-matter expertise; defining golden datasets that calibrate other evaluators.Slow, expensive, inconsistent across reviewers, doesn’t scale.
Code-basedDeterministic Python (or built-in templates) check the output.Anything that can be programmatically checked — string contains, JSON shape, regex match, deterministic comparison.Anything subjective — tone, helpfulness, factuality.
LLM-as-a-judgeAn LLM reads the data and emits a label and explanation.Subjective qualities at scale, with minimal up-front data; flexible across domains.Non-deterministic; requires prompt design; vulnerable to known biases.

Human evaluation

Human evaluation is the calibration target for everything else. When you want to know whether your other evaluators are correct, you compare them to human-labeled data. What humans are good at:
  • Nuance and context. A human reviewer with domain knowledge can spot subtle factual errors, tone problems, or context-specific failures that pattern-matching can’t.
  • Defining ground truth. Golden datasets — small, carefully-labeled collections of input/output pairs — are how every other evaluator gets validated. They almost always start with human labeling.
  • Edge cases. Rare failure modes that no automated system was designed to catch.
What humans aren’t good at:
  • Scale. A human reviewer can read maybe a few hundred traces a day; a production application generates orders of magnitude more.
  • Speed. Human turnaround is hours-to-days; automated turnaround is seconds-to-minutes.
  • Consistency. Two reviewers, given the same rubric, will disagree on a non-trivial fraction of cases. One reviewer, given the same trace twice on different days, will sometimes disagree with themselves.
  • Bandwidth. Humans cannot evaluate against four million tokens of context. Pattern-matching tasks like “find spans that match X criteria across the last month of data” are not human tasks.
Use humans to define and validate golden datasets, and to spot-check production. Don’t use humans as a continuous evaluator for any meaningful volume.

Code-based evaluators

Code-based evaluators run pure Python (or a built-in template) against the output. They are fast, cheap, deterministic, and repeatable — which makes them the right choice whenever the question can be expressed as a check. What code is good at:
  • Deterministic criteria. “Does the output contain a competitor’s name?” “Is the JSON shape correct?” “Did the function call use the right parameter names?” Anything where a correct/incorrect answer is fully determined by the data.
  • Reliability. A code evaluator gives the same answer every time. No retry loops, no rate limits, no model drift.
  • Speed and cost. Running a code evaluator is essentially free compared to an LLM call.
What code isn’t good at:
  • Anything subjective. Tone, factuality, helpfulness, coherence — these can’t be checked by pattern-matching.
  • Anything that needs reasoning. “Is the response consistent with the retrieved context?” requires understanding both pieces, which code can’t do.
A common pattern: combine code evaluators (for the deterministic checks) with LLM-as-a-judge evaluators (for the subjective ones) and use the code ones as cheap pre-filters before paying for the LLM ones.

LLM-as-a-judge

LLM-as-a-judge uses a pre-trained LLM to read the data and emit a label, score, and explanation. What LLM-as-a-judge is good at:
  • Subjective evaluation at scale. Tone, helpfulness, factuality, coherence — questions that need reasoning.
  • Flexibility. A new evaluator is a new prompt, not a new model.
  • No training data required up front. You’ll want a small golden dataset to validate the judge, but you don’t need thousands of examples to get started.
What LLM-as-a-judge isn’t good at:
  • Determinism. The same judge, run twice on the same input, can disagree with itself.
  • Cost-free operation. Every eval is an LLM call. At production volume, this matters — see Filters, scope, and cadence for the sampling levers.
  • Resistance to bias. Verbosity bias, position bias, self-enhancement bias — see Evaluator best practices for the catalog and how to guard against them.
For the canonical LLM-as-a-judge survey paper, see Gu et al., 2024 — A Survey on LLM-as-a-Judge.

Choosing between types

A decision tree:
  1. Is the question deterministic? (Yes = code-based. Done.)
  2. Do you need to evaluate at production scale? (No = human is fine for small samples.)
  3. Default for everything else = LLM-as-a-judge.
Most production setups end up running multiple types in parallel — code evaluators for the cheap deterministic checks, LLM-as-a-judge evaluators for the subjective questions, periodic human review to validate the judges. The three types complement each other; they don’t compete.

Next step

Whichever type you pick, every evaluator has the same shape — inputs, outputs, and metadata. The next page covers that anatomy:

Next: Anatomy of an Evaluator