Skip to main content
The hard part of LLM-as-a-judge isn’t getting the evaluator to run — it’s getting it to score in a way that matches what a human reviewer would say. Most evaluator failures fall into the same handful of patterns, each with a research paper to back up the recommendation. This page collects them. The DOs are what well-designed evaluators look like; the DON’Ts are the failure modes that show up most often.

DO: define explicit evaluation criteria

The biggest gap between evaluators that work and evaluators that don’t is the prompt. A vague prompt — “score this response from 1 to 5” — produces vague scores; an explicit prompt — “score 1 if the response cites the retrieved context, 0 otherwise” — produces useful ones. The mental model that helps: treat the LLM judge like an intern. An intern has no implicit context about your application, your users, or what good means in your domain. Everything they need to know has to be in the rubric. Concretely, a good evaluator prompt:
  • States the criteria in plain language.
  • Names the inputs explicitly (“the user’s question”, “the agent’s response”, “the retrieved context”).
  • Specifies the output shape (the exact labels, the score scale, whether to emit an explanation).
  • Gives a worked example for non-obvious cases.

DO: break complex evaluations into structured steps

When the evaluation question is complex — “is this response good?” — break it down into sub-criteria. “Is the response factually correct AND well-formatted AND on-topic?” becomes three separate evaluators, or one evaluator with a structured rubric. This works better than asking the model for a single holistic score for two reasons:
  1. Each sub-criterion is easier to judge accurately than the holistic question.
  2. Aggregating the sub-scores gives you a more interpretable composite — you can see which dimension dragged the score down.
The empirical backing for this pattern: Hashemi et al., 2024 — LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation. They show that decomposed rubrics cut error against human ratings roughly in half compared to a single-score judge.

DO: validate judges against human labels

A judge that runs is not the same as a judge that’s correct. Before trusting an LLM-as-a-judge in production, validate it against a golden dataset — a small set of input/output pairs with human-labeled ground truth. The validation workflow:
  1. Collect 20–100 examples spanning the score range you care about.
  2. Have humans label them.
  3. Run the evaluator against the same examples.
  4. Compute alignment (accuracy, precision/recall, F1) between the judge and the humans.
  5. Iterate the prompt until alignment is high enough.
The foundational paper on this approach: Chiang & Lee, 2023 — Can Large Language Models Be an Alternative to Human Evaluations?. They show that LLM judgments track human ratings well overall, but are sensitive to instruction wording — making human-grounded validation essential. A practical rule: use the same rubric with the humans that you give to the LLM. If the humans need a clarification to apply the rubric, the LLM needs that same clarification in its prompt.

DO: narrow the scope of each evaluator

Smaller context windows produce better evaluations. When the judge has to wade through ten thousand tokens of context to find the thing it’s scoring, it loses signal. When the judge has just the relevant excerpt, it scores cleanly. This is one of the reasons the two-filter model matters — the evaluator data filter exists specifically to narrow what gets passed to the prompt. A heuristic: if a human reviewer would only need to read 200 tokens of context to make this call, the LLM-as-a-judge shouldn’t be reading 5,000.

DO: guard against verbosity bias

LLM judges systematically prefer longer responses. Two outputs that are equally correct can get different scores purely because one is more verbose. The foundational paper: Zheng et al., 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. They name and quantify verbosity bias, position bias, and self-enhancement bias as the three canonical judge biases. How to guard against it:
  • Normalize lengths before scoring when you can — pre-trim or summarize long outputs.
  • Tell the judge explicitly in the prompt that length should not influence the score.
  • Validate against humans so you can spot the bias if it shows up in your specific evaluator.
For a deeper dive on bias mechanisms, the G-Eval paper (Liu et al., 2023, arXiv:2303.16634) documents how chain-of-thought prompting and structured form-filling reduce several bias categories simultaneously.

DON’T: use the same model for the application and the judge

LLM judges prefer outputs that look like what they would have written themselves. If you use gpt-5.4-mini to power your chatbot and gpt-5.4-mini to judge the chatbot’s responses, you get self-preference bias — the judge systematically rates the chatbot higher than it deserves. The mechanism is subtle. Wataoka et al., 2024 — Self-Preference Bias in LLM-as-a-Judge — shows the bias is driven by perplexity, not authorship recognition. The judge model assigns lower perplexity to text that looks like its own distribution, and lower perplexity reads as “more fluent” to the judge. This is why simple workarounds like “don’t tell the judge who wrote the response” don’t fix it — the bias is built into the scoring mechanism. The rule: use a different model for the judge than for the application. Different provider when possible (Anthropic judging OpenAI, or vice versa); at minimum, a different model from the same provider.

DON’T: use a single underspecified score for a complex task

The classic failure: “rate this response from 1 to 5”. It looks fine, it produces numbers, and the numbers are useless. The problem is that LLMs can’t anchor relative scales across independent calls. Each evaluation is a fresh inference; the judge has no memory of how it scored the last response, so it can’t reason that “this one is better than the previous one, so it should get a 4”. Numeric ranges only work for properties the model can score in isolation against an absolute reference. Two ways out:
  • Use categorical labels (correct / incorrect, relevant / partially_relevant / irrelevant) and map them to scores via --classification-choices. The model picks a category; the platform handles the numeric mapping.
  • Use multi-dimensional rubrics (per the LLM-Rubric paper above). Each dimension is a focused judgment; the composite is computed mechanically.

DON’T: set a large numerical scale

A specific case of the previous rule. 1–100 scales are worse than 1–10, which are worse than 1–5, which are worse than 1–3. LLMs are bad at math; asking them to consistently distinguish a 67 from a 73 doesn’t work. If you need a granular score, get it by combining multiple binary judgments, not by asking for a single graded one. Five binary sub-criteria yield 32 distinct composites — enough granularity for most analysis — without asking the model to do math it can’t do reliably.

DON’T: make evaluation prompts too long or complex

Eval prompts compound two costs: token cost (long prompts are expensive at production volume) and quality cost (long prompts dilute the judge’s focus). When in doubt, trim:
  • Cut examples that are illustrative but not essential.
  • Cut hedging language (“please carefully consider all aspects of…”).
  • Cut anti-patterns (“don’t be biased, don’t favor long responses, don’t…”) — these get ignored anyway and inflate the prompt.
A reasonable target: an eval prompt that fits on one screen. If it doesn’t, you’re probably trying to do too much in one evaluator — break it into sub-criteria.

DON’T: start with an expensive reasoning model

For evaluators, start with a cheap, fast model. Eval tasks are bounded by an explicit rubric; they don’t need a reasoning model’s planning capabilities. gpt-5.4-mini, claude-haiku-4-5, or equivalent budget models are almost always the right starting point. Reasons:
  • Eval prompts are constrained. The judge isn’t deciding what to do — it’s applying a fixed rubric you wrote.
  • Reasoning models are slow. At production volumes, evaluator latency matters.
  • Reasoning models are expensive. A budget model at 100% sampling often beats a reasoning model at 10% sampling for the same total cost.
Once the evaluator is working, you can A/B the budget judge against a reasoning judge against your golden dataset. In most cases, the budget judge holds up. When it doesn’t, you have a specific signal to act on.

Further reading

PaperWhat it argues
Liu et al., 2023 — G-Eval (EMNLP)Chain-of-thought + form-filling improves human alignment. Documents LLM-bias-toward-LLM-output.
Zheng et al., 2023 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS)The foundational LLM-as-a-judge paper. Names verbosity, position, and self-enhancement biases.
Hashemi et al., 2024 — LLM-Rubric (ACL)Multidimensional rubrics with calibration outperform single-score judges.
Gu et al., 2024 — A Survey on LLM-as-a-JudgeTaxonomy of LLM-as-a-judge systems, biases, and benchmarks.
Chiang & Lee, 2023 — Can Large Language Models Be an Alternative to Human Evaluations? (ACL)LLM judges can substitute for humans in many settings, but are prompt-sensitive — validation is essential.
Wataoka et al., 2024 — Self-Preference Bias in LLM-as-a-JudgeSelf-preference bias is a perplexity artifact, not authorship recognition.

Next step

Following the rules makes a good evaluator. To make it better over time — and to know whether it’s actually working — you need the validation and improvement cycle:

Next: Validating and Improving Evaluators