Harness as a Judge is available on Enterprise plans.
Why use it
| LLM-as-a-judge | Harness as a Judge | |
|---|---|---|
| How it scores | One judge prompt per span/trace; variables mapped to columns | Agent in a sandbox explores exported trace data and scores from your instructions |
| Setup | Template + column mappings | Scoring instructions only—no column mapping required |
| Best for | High-volume, repeatable checks with stable inputs | Nuanced criteria, multi-field reasoning, or evals that benefit from reading context across a trace |
- Relevance or helpfulness when the right answer depends on full trace context
- Agent trajectory quality (tool choice, recovery, multi-step reasoning)
- Custom rubrics that are easier to describe in prose than to wire into
{variable}mappings
How it works
Configure the evaluator in the Evaluator Hub—Evaluators → Create → Harness Evaluator. Select the harness (Claude Code today), pick an Anthropic model (or Auto), then write scoring instructions in plain language. Optional placeholders like{attributes.output.value} are filled from span data. Optionally define fixed labels or let the agent decide each run.
Attach to an online eval task on an LLM project—date range, query filter, and sampling rate—same flow as Run online evals on traces.
On each run, the platform starts a sandbox for that harness. The agent reads exported spans for the task window, scores them from your instructions, and publishes eval.<name>.* columns on the spans.
View results on traces, in dashboards, and in task run history. See View eval results.
The agent gets read access to traces on the bound project automatically. Add the optional Arize skill only if you need broader API access in the sandbox.
Create a harness evaluator
Select model
Pick an Anthropic AI integration and model, or Auto.
Write scoring instructions
Describe what good and bad look like—for example, whether the assistant’s response is relevant to the user input given the full trace.The agent reads traces at run time; you do not map template variables to columns upfront.
Configure labels (optional)
Leave Let agent decide labels on for open-ended rubrics, or turn it off to define fixed labels and scores (for example
relevant / irrelevant).