> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluator Types — Human, Code, and LLM-as-a-Judge

> Three kinds of evaluators — human, code-based, and LLM-as-a-judge. What each is good at, what each isn't, and how to choose between them.

The level of an evaluator tells you what slice of data it looks at. The **type** tells you how it makes its judgment. Three kinds of evaluators exist, and they aren't interchangeable — each is suited to a different class of question.

# The three types

| Type               | How it works                                                   | Where it shines                                                                                                     | Where it doesn't                                                       |
| :----------------- | :------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------- |
| **Human**          | A reviewer scores outputs against a rubric.                    | Capturing nuance and subject-matter expertise; defining golden datasets that calibrate other evaluators.            | Slow, expensive, inconsistent across reviewers, doesn't scale.         |
| **Code-based**     | Deterministic Python (or built-in templates) check the output. | Anything that can be programmatically checked — string contains, JSON shape, regex match, deterministic comparison. | Anything subjective — tone, helpfulness, factuality.                   |
| **LLM-as-a-judge** | An LLM reads the data and emits a label and explanation.       | Subjective qualities at scale, with minimal up-front data; flexible across domains.                                 | Non-deterministic; requires prompt design; vulnerable to known biases. |

# Human evaluation

Human evaluation is the calibration target for everything else. When you want to know whether your other evaluators are *correct*, you compare them to human-labeled data.

What humans are good at:

* **Nuance and context.** A human reviewer with domain knowledge can spot subtle factual errors, tone problems, or context-specific failures that pattern-matching can't.
* **Defining ground truth.** Golden datasets — small, carefully-labeled collections of input/output pairs — are how every other evaluator gets validated. They almost always start with human labeling.
* **Edge cases.** Rare failure modes that no automated system was designed to catch.

What humans aren't good at:

* **Scale.** A human reviewer can read maybe a few hundred traces a day; a production application generates orders of magnitude more.
* **Speed.** Human turnaround is hours-to-days; automated turnaround is seconds-to-minutes.
* **Consistency.** Two reviewers, given the same rubric, will disagree on a non-trivial fraction of cases. One reviewer, given the same trace twice on different days, will sometimes disagree with themselves.
* **Bandwidth.** Humans cannot evaluate against four million tokens of context. Pattern-matching tasks like "find spans that match X criteria across the last month of data" are not human tasks.

Use humans to **define and validate** golden datasets, and to spot-check production. Don't use humans as a continuous evaluator for any meaningful volume.

# Code-based evaluators

Code-based evaluators run pure Python (or a built-in template) against the output. They are fast, cheap, deterministic, and repeatable — which makes them the right choice whenever the question can be expressed as a check.

What code is good at:

* **Deterministic criteria.** "Does the output contain a competitor's name?" "Is the JSON shape correct?" "Did the function call use the right parameter names?" Anything where a correct/incorrect answer is fully determined by the data.
* **Reliability.** A code evaluator gives the same answer every time. No retry loops, no rate limits, no model drift.
* **Speed and cost.** Running a code evaluator is essentially free compared to an LLM call.

What code isn't good at:

* **Anything subjective.** Tone, factuality, helpfulness, coherence — these can't be checked by pattern-matching.
* **Anything that needs reasoning.** "Is the response consistent with the retrieved context?" requires understanding both pieces, which code can't do.

A common pattern: combine code evaluators (for the deterministic checks) with LLM-as-a-judge evaluators (for the subjective ones) and use the code ones as cheap pre-filters before paying for the LLM ones.

# LLM-as-a-judge

LLM-as-a-judge uses a pre-trained LLM to read the data and emit a label, score, and explanation.

What LLM-as-a-judge is good at:

* **Subjective evaluation at scale.** Tone, helpfulness, factuality, coherence — questions that need reasoning.
* **Flexibility.** A new evaluator is a new prompt, not a new model.
* **No training data required up front.** You'll want a small golden dataset to validate the judge, but you don't need thousands of examples to get started.

What LLM-as-a-judge isn't good at:

* **Determinism.** The same judge, run twice on the same input, can disagree with itself.
* **Cost-free operation.** Every eval is an LLM call. At production volume, this matters — see [Filters, scope, and cadence](/ax/concepts/evaluators/filters-scope-and-cadence) for the sampling levers.
* **Resistance to bias.** Verbosity bias, position bias, self-enhancement bias — see [Evaluator best practices](/ax/concepts/evaluators/evaluator-best-practices) for the catalog and how to guard against them.

For the canonical LLM-as-a-judge survey paper, see Gu et al., 2024 — [A Survey on LLM-as-a-Judge](https://arxiv.org/abs/2411.15594).

# Choosing between types

A decision tree:

1. **Is the question deterministic?** (Yes = code-based. Done.)
2. **Do you need to evaluate at production scale?** (No = human is fine for small samples.)
3. **Default for everything else** = LLM-as-a-judge.

Most production setups end up running multiple types in parallel — code evaluators for the cheap deterministic checks, LLM-as-a-judge evaluators for the subjective questions, periodic human review to validate the judges. The three types complement each other; they don't compete.

***

## Next step

Whichever type you pick, every evaluator has the same shape — inputs, outputs, and metadata. The next page covers that anatomy:

<Card title="Next: Anatomy of an Evaluator" icon="arrow-right" href="/ax/concepts/evaluators/anatomy-of-an-evaluator" />
