DO: define explicit evaluation criteria
The biggest gap between evaluators that work and evaluators that don’t is the prompt. A vague prompt — “score this response from 1 to 5” — produces vague scores; an explicit prompt — “score 1 if the response cites the retrieved context, 0 otherwise” — produces useful ones. The mental model that helps: treat the LLM judge like an intern. An intern has no implicit context about your application, your users, or what good means in your domain. Everything they need to know has to be in the rubric. Concretely, a good evaluator prompt:- States the criteria in plain language.
- Names the inputs explicitly (“the user’s question”, “the agent’s response”, “the retrieved context”).
- Specifies the output shape (the exact labels, the score scale, whether to emit an explanation).
- Gives a worked example for non-obvious cases.
DO: break complex evaluations into structured steps
When the evaluation question is complex — “is this response good?” — break it down into sub-criteria. “Is the response factually correct AND well-formatted AND on-topic?” becomes three separate evaluators, or one evaluator with a structured rubric. This works better than asking the model for a single holistic score for two reasons:- Each sub-criterion is easier to judge accurately than the holistic question.
- Aggregating the sub-scores gives you a more interpretable composite — you can see which dimension dragged the score down.
DO: validate judges against human labels
A judge that runs is not the same as a judge that’s correct. Before trusting an LLM-as-a-judge in production, validate it against a golden dataset — a small set of input/output pairs with human-labeled ground truth. The validation workflow:- Collect 20–100 examples spanning the score range you care about.
- Have humans label them.
- Run the evaluator against the same examples.
- Compute alignment (accuracy, precision/recall, F1) between the judge and the humans.
- Iterate the prompt until alignment is high enough.
DO: narrow the scope of each evaluator
Smaller context windows produce better evaluations. When the judge has to wade through ten thousand tokens of context to find the thing it’s scoring, it loses signal. When the judge has just the relevant excerpt, it scores cleanly. This is one of the reasons the two-filter model matters — the evaluator data filter exists specifically to narrow what gets passed to the prompt. A heuristic: if a human reviewer would only need to read 200 tokens of context to make this call, the LLM-as-a-judge shouldn’t be reading 5,000.DO: guard against verbosity bias
LLM judges systematically prefer longer responses. Two outputs that are equally correct can get different scores purely because one is more verbose. The foundational paper: Zheng et al., 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. They name and quantify verbosity bias, position bias, and self-enhancement bias as the three canonical judge biases. How to guard against it:- Normalize lengths before scoring when you can — pre-trim or summarize long outputs.
- Tell the judge explicitly in the prompt that length should not influence the score.
- Validate against humans so you can spot the bias if it shows up in your specific evaluator.
DON’T: use the same model for the application and the judge
LLM judges prefer outputs that look like what they would have written themselves. If you usegpt-5.4-mini to power your chatbot and gpt-5.4-mini to judge the chatbot’s responses, you get self-preference bias — the judge systematically rates the chatbot higher than it deserves.
The mechanism is subtle. Wataoka et al., 2024 — Self-Preference Bias in LLM-as-a-Judge — shows the bias is driven by perplexity, not authorship recognition. The judge model assigns lower perplexity to text that looks like its own distribution, and lower perplexity reads as “more fluent” to the judge. This is why simple workarounds like “don’t tell the judge who wrote the response” don’t fix it — the bias is built into the scoring mechanism.
The rule: use a different model for the judge than for the application. Different provider when possible (Anthropic judging OpenAI, or vice versa); at minimum, a different model from the same provider.
DON’T: use a single underspecified score for a complex task
The classic failure: “rate this response from 1 to 5”. It looks fine, it produces numbers, and the numbers are useless. The problem is that LLMs can’t anchor relative scales across independent calls. Each evaluation is a fresh inference; the judge has no memory of how it scored the last response, so it can’t reason that “this one is better than the previous one, so it should get a 4”. Numeric ranges only work for properties the model can score in isolation against an absolute reference. Two ways out:- Use categorical labels (
correct/incorrect,relevant/partially_relevant/irrelevant) and map them to scores via--classification-choices. The model picks a category; the platform handles the numeric mapping. - Use multi-dimensional rubrics (per the LLM-Rubric paper above). Each dimension is a focused judgment; the composite is computed mechanically.
DON’T: set a large numerical scale
A specific case of the previous rule. 1–100 scales are worse than 1–10, which are worse than 1–5, which are worse than 1–3. LLMs are bad at math; asking them to consistently distinguish a 67 from a 73 doesn’t work. If you need a granular score, get it by combining multiple binary judgments, not by asking for a single graded one. Five binary sub-criteria yield 32 distinct composites — enough granularity for most analysis — without asking the model to do math it can’t do reliably.DON’T: make evaluation prompts too long or complex
Eval prompts compound two costs: token cost (long prompts are expensive at production volume) and quality cost (long prompts dilute the judge’s focus). When in doubt, trim:- Cut examples that are illustrative but not essential.
- Cut hedging language (“please carefully consider all aspects of…”).
- Cut anti-patterns (“don’t be biased, don’t favor long responses, don’t…”) — these get ignored anyway and inflate the prompt.
DON’T: start with an expensive reasoning model
For evaluators, start with a cheap, fast model. Eval tasks are bounded by an explicit rubric; they don’t need a reasoning model’s planning capabilities.gpt-5.4-mini, claude-haiku-4-5, or equivalent budget models are almost always the right starting point.
Reasons:
- Eval prompts are constrained. The judge isn’t deciding what to do — it’s applying a fixed rubric you wrote.
- Reasoning models are slow. At production volumes, evaluator latency matters.
- Reasoning models are expensive. A budget model at 100% sampling often beats a reasoning model at 10% sampling for the same total cost.
Further reading
| Paper | What it argues |
|---|---|
| Liu et al., 2023 — G-Eval (EMNLP) | Chain-of-thought + form-filling improves human alignment. Documents LLM-bias-toward-LLM-output. |
| Zheng et al., 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS) | The foundational LLM-as-a-judge paper. Names verbosity, position, and self-enhancement biases. |
| Hashemi et al., 2024 — LLM-Rubric (ACL) | Multidimensional rubrics with calibration outperform single-score judges. |
| Gu et al., 2024 — A Survey on LLM-as-a-Judge | Taxonomy of LLM-as-a-judge systems, biases, and benchmarks. |
| Chiang & Lee, 2023 — Can Large Language Models Be an Alternative to Human Evaluations? (ACL) | LLM judges can substitute for humans in many settings, but are prompt-sensitive — validation is essential. |
| Wataoka et al., 2024 — Self-Preference Bias in LLM-as-a-Judge | Self-preference bias is a perplexity artifact, not authorship recognition. |