Agent-as-a-Judge

Agent-as-a-Judge is available in closed Enterprise beta. Contact your Arize account team for access.

Agent-as-a-Judge runs evaluations with a harness, not as a single LLM API call. Claude Code is supported today; Cursor, Codex, Hermes, and OpenClaw are coming soon. You describe what to score in plain language; the agent pulls span and trace data from your project at run time, applies your criteria, and writes results back as eval columns—same as LLM-as-a-judge and code evaluators.

Use it when a fixed prompt and column mapping are too rigid for the judgment you need.

Why use it

	LLM-as-a-judge	Agent-as-a-Judge
How it scores	One judge prompt per span/trace; variables mapped to columns	Agent explores exported trace data and scores from your instructions
Setup	Template + column mappings	Scoring instructions only—no column mapping required
Best for	High-volume, repeatable checks with stable inputs	Nuanced criteria, multi-field reasoning, or evals that benefit from reading context across a trace

Agent-as-a-Judge is for subjective or complex quality checks where you want an agent to interpret production data—not just fill a template. Examples:

Relevance or helpfulness when the right answer depends on full trace context
Agent trajectory quality (tool choice, recovery, multi-step reasoning)
Custom rubrics that are easier to describe in prose than to wire into {variable} mappings

For deterministic rules (JSON shape, regex, keyword checks), use a code evaluator. For simple, high-throughput evals use LLM-as-a-judge. Many teams use all three on the same project.

How it works

Configure the evaluator in the Evaluator Hub—Evaluators → Create → Agent-as-a-Judge. Select a harness (Claude Code is supported today; Cursor, Codex, Hermes, and OpenClaw are coming soon), pick an Anthropic model (or Auto), then write scoring instructions in plain language. Optional placeholders like {attributes.output.value} are filled from span data. Optionally define fixed labels or let the harness decide each run. Attach to an online eval task on an LLM project—date range, query filter, and sampling rate—same flow as Run online evals on traces. On each run, the platform starts the selected harness. The harness reads exported spans for the task window, scores them from your instructions, and publishes eval.<name>.* columns on the spans. View results on traces, in dashboards, and in task run history. See View eval results. The harness gets read access to traces on the bound project automatically. Add the optional Arize skill only if you need broader API access in the harness.

Create an Agent-as-a-Judge evaluator

Open Evaluator Hub

Go to Evaluators in the space sidebar, then Create and choose Agent-as-a-Judge.

Select harness

Choose a harness. Claude Code is supported today; Cursor, Codex, Hermes, and OpenClaw are coming soon.

Select model

Pick an Anthropic AI integration and model, or Auto.

Write scoring instructions

Describe what good and bad look like—for example, whether the assistant’s response is relevant to the user input given the full trace.The agent reads traces at run time; you do not map template variables to columns upfront.

Configure labels (optional)

Leave Let agent decide labels on for open-ended rubrics, or turn it off to define fixed labels and scores (for example relevant / irrelevant).

Save to Evaluator Hub

The evaluator is versioned like LLM and code evaluators—reuse it across tasks.

Run on production traces

After saving the evaluator, create or edit an online eval task on your LLM project and add the agent evaluator. Set project, time range, and sampling—the same controls as template eval tasks. Each task run provisions the selected harness, exports up to the task’s span limit for that window, and runs it with your scoring instructions. Cancel a run from task history to tear down the harness. Results appear as eval attributes on spans. Filter and monitor them in Tracing and Dashboards.

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

Why use it

How it works

Create an Agent-as-a-Judge evaluator

Run on production traces

​Why use it

​How it works

​Create an Agent-as-a-Judge evaluator

​Run on production traces

Why use it

How it works

Create an Agent-as-a-Judge evaluator

Run on production traces