An LLM jury runs the same judgment through several LLMs — typically from different providers — and combines their verdicts. Each juror has a trust weight, the final score is the weighted average of the per-juror scores, and the per-juror labels (plus any errors) land in theDocumentation Index
Fetch the complete documentation index at: https://arizeai-433a7140.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
explanation so you can audit who voted what.
The example below is reference-free: the jurors score the answer against the question alone, with no ground-truth label to compare against. This is the right pattern when you don’t have labeled data — production traces, open-ended generation, or any task where the “correct” answer isn’t a fixed string you can store in a dataset.
Reference-free vs. reference-based. An evaluator is reference-based when it compares the output to a known-good answer stored alongside the example — that label is what makes a dataset a golden dataset, and metrics like JSON Distance or the Pairwise Evaluator rely on it. Reference-free evaluators skip that comparison and judge the output against the input alone (or against rubric criteria), which is what you have to do when no golden answer exists. To convert this example to reference-based, change the prompt to compare
output against reference and bind the reference parameter in the input-mapping panel.- One judge’s biases or self-preference are skewing your scores and you want a more robust verdict.
- The output is high-stakes and the extra cost of N model calls is worth a more reliable score.
- You’re benchmarking judge models against each other — the per-juror breakdown shows their agreement rate over the dataset.
arize-phoenix-evals / @arizeai/phoenix-evals to put every provider behind a uniform ClassificationEvaluator interface, so adding a juror is a one-liner.
Every juror sees the same prompt; their verdicts and weights combine into one score, and the per-juror labels (plus any errors) land in the explanation.
Code
- Python
- TypeScript
Input mapping
| Parameter | Bind to |
|---|---|
input | The question (or prompt/task) the model was asked to answer, usually input. |
output | The model output to grade, usually output. |
reference is required — this is a reference-free judgment. If you do have labeled data and want to use it, add a reference parameter to the function signature, update the prompt to incorporate it, and bind it in the input-mapping panel.
Output configuration
Continuous score in the range0.0 to 1.0 (matches the weighted average of the configured choice scores). Optimization direction: maximize.
Runtime requirements
| Setting | Value |
|---|---|
| Sandbox | A hosted backend that matches your language. Python: E2B, Daytona — Python, Vercel Sandbox — Python, or Modal. TypeScript: Daytona — TypeScript or Vercel Sandbox — TypeScript (the local Deno sandbox is --no-npm and cannot install npm packages). |
| Dependencies | Python: arize-phoenix-evals, openai, anthropic, google-genai. TypeScript: @arizeai/phoenix-evals, ai, @ai-sdk/openai, @ai-sdk/anthropic, @ai-sdk/google. |
| Internet access | Required — the sandbox must reach api.openai.com, api.anthropic.com, and generativelanguage.googleapis.com. |
| Environment variables | OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_GENERATIVE_AI_API_KEY — set each as a secret reference to a key in Settings → Secrets. Drop a juror entirely if you don’t want to provision that provider’s key. |
Variants
Weighted majority vote (categorical jury)
Instead of averaging numeric scores, count votes per label and return whichever label wins by weight. Use this when downstream consumers want a discrete verdict (correct / incorrect) rather than a continuous score. The sketch below replaces the score-averaging block; the per-juror collection loop stays the same:
Agreement filter (high-confidence subset)
For training-data filtering, return a non-zero score only when all jurors agree — otherwise drop the example. Score collapses to0 or 1; label becomes agreement / disagreement.
Related
- Pairwise Evaluator — single LLM judging
outputhead-to-head against areferencebaseline (blinded to avoid position bias). - Composite Evaluator — combine different axes of judgment (from possibly different judges) into one score.
Further reading
- Who can we trust? LLM-as-a-jury for Comparative Assessment — Qian et al. Proposes BT-sigma, jointly inferring item rankings and judge reliability from pairwise comparisons so jurors with different trust levels can be aggregated without weighting them equally. Useful background when tuning the per-juror weights in the example above.
- 12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation — Ersoz. Finds anchoring is the dominant failure mode of LLMs in simulated jury deliberation — supporting the design choice in this page of running jurors independently in parallel rather than letting them see each other’s verdicts.

