Skip to main content
Harness as a Judge is available on Enterprise plans.
Harness as a Judge runs evaluations inside a Claude Code sandbox, not as a single LLM API call. You describe what to score in plain language; the agent pulls span and trace data from your project at run time, applies your criteria, and writes results back as eval columns—same as LLM-as-a-judge and code evaluators. Use it when a fixed prompt and column mapping are too rigid for the judgment you need.

Why use it

LLM-as-a-judgeHarness as a Judge
How it scoresOne judge prompt per span/trace; variables mapped to columnsAgent in a sandbox explores exported trace data and scores from your instructions
SetupTemplate + column mappingsScoring instructions only—no column mapping required
Best forHigh-volume, repeatable checks with stable inputsNuanced criteria, multi-field reasoning, or evals that benefit from reading context across a trace
Harness as a Judge is for subjective or complex quality checks where you want an agent to interpret production data—not just fill a template. Examples:
  • Relevance or helpfulness when the right answer depends on full trace context
  • Agent trajectory quality (tool choice, recovery, multi-step reasoning)
  • Custom rubrics that are easier to describe in prose than to wire into {variable} mappings
For deterministic rules (JSON shape, regex, keyword checks), use a code evaluator. For simple, high-throughput evals use LLM-as-a-judge. Many teams use all three on the same project.

How it works

Configure the evaluator in the Evaluator HubEvaluators → Create → Harness Evaluator. Select the harness (Claude Code today), pick an Anthropic model (or Auto), then write scoring instructions in plain language. Optional placeholders like {attributes.output.value} are filled from span data. Optionally define fixed labels or let the agent decide each run. Attach to an online eval task on an LLM project—date range, query filter, and sampling rate—same flow as Run online evals on traces. On each run, the platform starts a sandbox for that harness. The agent reads exported spans for the task window, scores them from your instructions, and publishes eval.<name>.* columns on the spans. View results on traces, in dashboards, and in task run history. See View eval results. The agent gets read access to traces on the bound project automatically. Add the optional Arize skill only if you need broader API access in the sandbox.

Create a harness evaluator

1

Open Evaluator Hub

Go to Evaluators in the space sidebar, then Create and choose Harness Evaluator.
2

Select harness

Choose Claude Code as the evaluation harness (additional harnesses are coming soon).
3

Select model

Pick an Anthropic AI integration and model, or Auto.
4

Write scoring instructions

Describe what good and bad look like—for example, whether the assistant’s response is relevant to the user input given the full trace.The agent reads traces at run time; you do not map template variables to columns upfront.
5

Configure labels (optional)

Leave Let agent decide labels on for open-ended rubrics, or turn it off to define fixed labels and scores (for example relevant / irrelevant).
6

Save to Evaluator Hub

The evaluator is versioned like LLM and code evaluators—reuse it across tasks.

Run on production traces

After saving the evaluator, create or edit an online eval task on your LLM project and add the harness evaluator. Set project, time range, and sampling—the same controls as template eval tasks. Each task run provisions a sandbox, exports up to the task’s span limit for that window, and runs Claude Code with your scoring instructions. Cancel a run from task history to tear down the sandbox. Results appear as eval attributes on spans. Filter and monitor them in Tracing and Dashboards.