Skip to main content

View trace and span level results

Evaluation results attach directly to your spans. Open any trace in the Tracing view and use the evaluation panel on each span to inspect labels, scores, and explanations. Results also appear at trace and session scope where configured.
Playground Traces with summary charts for traffic, span latency, tokens and cost, and a custom eval metric, plus a traces table showing Span Evaluations tags per row alongside latency and token columns
Trace detail with span tree and Evaluations tab open for a ChatCompletion span, showing span-scoped eval rows with name, label, score, and explanation for qa and hallucination

View results on experiments

For evals you run on a dataset experiment, open the experiment table on the dataset and use View Eval Trace from a row to open the evaluated trace. Use Compare Experiments for side-by-side examples, eval labels, and judge explanations. The Playground also shows per-row annotation labels, model output, aggregate average Human v AI alignment score, and per-row Human v AI alignment tags. See Run offline evals on experiments for screenshots and full UI detail.
Playground with a tone evaluator prompt and results table showing annotation labels, model output, aggregate Avg Human v AI align score, and per-row Human v AI align tags for aligned or not aligned
Compare Experiments table tab for a dataset showing example rows, experiment output column, Evals tags for correct and incorrect, and a popover with score, label, and explanation for an evaluator

Configure dashboards

Aggregate eval trends alongside latency, errors, and usage in Dashboards. Add widgets that query evaluation labels and scores to monitor quality over time. Custom SQL metrics can also incorporate evaluation columns. See Custom metric examples.
Dashboard widget editor for an Eval Result bar chart counting eval labels from a tracing project, with Data settings for project, eval attribute, and filters

Debug

Logs

The Task Logs page shows your task configuration, including which evaluators and datasource are attached, alongside a run history with timestamp, status, and trigger for each run. From any row you can view the evals or jump directly to the trace. To get there: open the Evaluators page, select the Running Tasks tab, and open any task.
Evaluators Running Eval Tasks with Task Logs side panel showing evaluators and datasource, run history chart, and per-run status with View Evals and View Trace actions

View eval traces

From the task logs, click View Trace on a run to jump directly to the spans evaluated in that run with the same date range and filters applied. If the task used sampling below 100% or span caps, not every span will have evaluation results attached. On a dataset Experiments tab, open the row menu on a run and choose View Eval Trace for the evaluated trace, or View Logs for that experiment’s run output.
Dataset Experiments tab with summary metrics and experiment rows, row menu open showing View Eval Trace and View Logs

Track evaluation cost

Evaluations, especially LLM-as-a-judge runs at scale, consume tokens and model spend. Use Arize cost tracking and project metrics to reason about evaluation cost alongside application LLM cost.
Note: Playground runs display evaluation cost inline. For production tasks, configure cost tracking as described below.
Playground experiment results table with average cost per row and a Total Cost popover showing aggregate output spend for the eval run

Configure cost tracking

Arize AX tracks model usage from traces using token fields and your pricing configuration. Set this up before you rely on cost dashboards; cost is not retroactive. See Cost tracking for how lookup works, supported token types, and how to configure default or custom pricing.

Estimate evaluation cost with custom metrics

If you need a ballpark for judge spend, you can combine trace token counts with an estimate of your evaluation prompt size and judge output length. The custom metric examples page includes an Evaluation Cost Estimate SQL pattern you can adapt to your templates.

Reduce evaluation spend

  • Lower the sampling rate on online evaluation tasks.
  • Prefer code evaluators for objective checks when they are sufficient. See Create evaluators.
  • Reuse a single well-versioned judge in the Evaluator Hub instead of duplicating prompts across tasks.