Agent evaluation metrics: how to measure whether an agent works

Learn how to choose agent evaluation metrics by agent type, including quality, cost, safety, behavior, and performance KPIs.

Your agent succeeded if it completed the task correctly, safely, and within the constraints of the workflow, and your metric should reflect that when you’re building evaluations.

That sounds obvious until you look at most agent dashboards. They track latency, tokens, traces, and tool calls.

These are useful signals, but not proof in and of themselves that the agent did the job.

A useful agent metric should help answer one of three questions: did the agent complete the task, why did it fail, or what should we change next?

As agents move into production workflows, agent evaluation is becoming a continuous engineering practice.

That practice needs more than generic system metrics. It needs job-specific metrics: resolution for support, test pass rate for code, and citation accuracy for research.

From there, teams can add shared signals across the full agent path: step quality, cost, safety, latency, and behavior.

Agent evaluation metrics should start with the job

Agent metrics break when teams start with generic scores.

Start with the work the agent performs and then ask what a successful run should look like.

That job depends on what the workflow owner wants automated, and where they feel confident letting the agent act.

For each workflow, define:

  • What counts as success?
  • What failure modes matter most?
  • What evidence proves success or failure?
  • What action will the team take if the metric moves?

Some examples:

  • A support owner may want the agent to resolve password resets, but escalate billing disputes.
  • An engineering lead may want a coding agent to fix tests, but leave architecture changes for review.
  • A research team may want citation-backed summaries, but block unsupported claims.
  • An operations team may allow read-only lookups, but require approval before writing actions.

The metric should match the task, the user, and the cost of failure. Otherwise, the dashboard creates false confidence.

Agent quality metrics: correctness, grounding, and task fit

Different agents need different evaluation KPIs because they create value in different ways.

“Quality” should not be treated as a single score. For agents, quality usually combines task success, factual correctness, instruction following, tool-use correctness, and whether the final answer was useful to the user.

The primary metric should prove the agent completed its main job. Supporting metrics should catch the failure modes around it.

Agent type Primary metric Supporting metrics
Support agent Resolution rate Escalation rate, CSAT, first-contact resolution
Coding agent Test pass rate Build success, review acceptance, regression rate
Research agent Citation accuracy Source relevance, claim support, freshness
Sales agent Qualified action completion CRM accuracy, handoff quality, response latency
Operations agent Task completion rate Tool success, retry rate, approval rate

Start with the metric that proves the agent did its job, and then add the metrics that explain failure modes. For example, resolution rate may tell you a support agent is underperforming while tool-choice accuracy, retrieval quality, or escalation reason may tell you why.

Metrics that create false confidence

Some metrics are useful only as supporting signals:

  • Total tool calls: high or low is not inherently good.
  • Average latency: hides slow tail behavior; use P95 or P99.
  • Token volume: useful only when tied to task success.
  • LLM judge score: useful only when calibrated against examples.
  • User thumbs-up/down: helpful, but often sparse and biased.

These metrics become more useful when connected to outcomes, traces, and failure categories.

Cost metrics for agent efficiency

Cost metrics show whether the agent completed the task at a price the workflow can support.

These metrics are starting points. Each can become more specific as the workflow gets more traffic and the team learns where cost actually comes from. (We’ll follow the same pattern for other metrics as well.)

Cost per resolution

Cost per resolution measures the full price of a completed task across the agent path.

For a support agent, that includes prompts, retrieved context, tool calls, retries, judge passes, handoffs, and human review.

That full-chain view is where agent evaluation connects to AI ROI and workflow evidence. The useful question is whether spending produced a completed workflow.

A practical version is simple: divide total agent cost by successful resolutions, completed workflows, approved outputs, or other task-level outcomes.

From there, teams can break the metric down by:

  • model route
  • retry cost
  • judge cost
  • tool-call cost
  • review cost
  • and escalation cost.

Tokens per task

Tokens per task shows how much context the agent needed to complete one task.

Long prompts, oversized retrieval results, hidden system prompts, repeated attempts, and provider-side overhead can turn small requests into expensive runs.

This is where span-level token tracking across LLM calls and traces helps. Agent cost often comes from several calls across planning, retrieval, tool use, judging, and final response generation.

Teams can split this metric further into

  • prompt tokens
  • completion tokens
  • cached tokens
  • new tokens
  • tokens by model route
  • and tokens by workflow step.

Latency P95

Latency P95 shows the slow edge of the workflow. These are the runs users remember.

Slow runs often come from long context, repeated tool calls, retry loops, weak routing, slow providers, and backed-up infrastructure.

For high-volume agents, this can branch into:

  • queue depth
  • batching efficiency
  • GPU utilization
  • warm capacity
  • throughput
  • timeout rate
  • and provider response time.

Track P95 beside cost so teams can find agent paths that are slow, expensive, or wasteful.

Safety metrics for agent risk

Safety metrics show whether the agent handles risk before it reaches the user, system, or business process.

The first thing to measure is the agent’s permission boundary. A read-only agent and a write-capable agent carry very different risks.

From there, safety metrics should track where the agent stops, escalates, blocks output, or avoids unsafe tool use.

Refusals and blocked outputs

Refusal rate shows how often the agent declines requests outside the allowed workflow.

Track harmful requests, restricted content, sensitive-data exposure attempts, unsupported actions, and blocked write requests.

As the workflow matures, this can expand into:

  • refusal reason
  • policy category
  • user intent
  • false refusal rate
  • blocked output type

The goal is to see whether the agent blocks risky requests while still completing safe ones.

Escalation rate

Escalation rate shows when the agent sends the workflow to a human.

Escalation looks different depending on the workflow:

  • For support agents, that may include billing disputes, account access issues, frustrated customers, or missing context.
  • For internal agents, escalation may mean approval before write actions, legal review, security review, or manager sign-off.

This can expand into escalation reasons, escalation timing, human takeover rate, and resolution after escalation.

Teams should separate good escalations from bad escalations. Escalating a high-risk billing dispute may be correct. Escalating a simple password reset because retrieval failed is a system problem.

Unsafe tool actions

Unsafe tool-action metrics show whether the agent tried to use tools in risky or incorrect ways.

This matters when agents can update records, issue refunds, send messages, change permissions, delete data, or trigger external workflows.

Track the specific risky actions:

This metric becomes more important as agents move from answering questions to changing system state.

Behavior metrics for agent paths

Behavior metrics show how the agent moves through a workflow during normal use. Safety metrics catch boundary failures; behavior metrics catch the smaller changes teams deal with every day: longer answers, extra retries, weaker routing, repeated retrieval, and shifting completion paths.

These shifts are hard to debug in a multi-vendor stack. The agent may depend on one model provider, a separate retriever, another tool layer, and shared infrastructure underneath it.

Catching behavioral changes in agent workflows

A behavior change can come from many places: a model update, a reasoning change, a routing rule, a retriever update, a provider-side issue, or cluster allocation.

Behavior metrics show what changed inside the agent path before the final answer becomes obviously worse.

Behavior metric What it shows
Steps per task How much work the agent needs to finish the workflow
Retry rate Where the agent gets stuck or repeats the same step
Retrieval repetition Whether the agent keeps searching for the same context
Tool-use frequency Whether the agent depends on tools more than expected
Clarification rate Whether the agent has enough context to act
Answer length Whether responses are drifting longer, shorter, or less direct
Completion path Which route the agent usually takes to finish the task

Extra steps or higher numbers are not automatically bad. The problem is unexplained change. If the same workflow suddenly takes more steps, costs more, or asks more clarifying questions, behavior metrics tell the team where to look.

How to build an agent evaluation dashboard

An agent evaluation dashboard should help teams decide what to fix next, not collect every metric the system can emit.

Start with two metrics

Start with one outcome metric and one failure metric. A dashboard with twenty charts on day one usually creates noise.

The outcome metric should prove the agent did the job. That could be resolution rate, test pass rate, citation accuracy, or task completion.

The failure metric should point to the most expensive, frequent, or risky way the workflow breaks. That could be escalation rate, retry rate, unsafe tool attempts, or cost per completed task.

Build the evaluation MVP

Arize AX Demo (2025): One place for development, observability, and evaluation.

Once those metrics are defined, the next step is making them real inside the workflow.

In Arize AX, teams can attach evaluation results to spans, traces, and sessions, then inspect labels, scores, explanations, latency, tokens, and cost in the same trace view.

That means the outcome metric can become a trace-level score, while the failure metric can become a span-level or path-level signal.

A useful MVP looks like this:

  • Instrument the agent trace.
  • Log the outcome metric.
  • Score the failure metric.
  • Connect scores to examples.
  • Review regressions weekly.

For agent workflows, Arize’s agent evaluation concepts map well here because they separate what the agent knows, what actions it can take, and what path it follows.

For teams already running evals continuously, Arize also supports viewing eval results and costs so the dashboard tracks evaluation spend alongside application behavior.

The dashboard should turn agent metrics into the next operating decision: what to fix, what to ship, and what to watch.

Update the dashboard as the agent improves

Fixed metrics fail when the agent outgrows the original failure mode.

Early agents may need basic correctness checks. Better agents usually need tighter cost, behavior, safety, and edge-case metrics.

The dashboard should change when the bottleneck changes. A support agent may start with resolution rate, then may need escalation quality once basic answers improve.

The best dashboard keeps the metric set small, but updates it as production failures change.

How Arize helps teams evaluate agent performance

Arize helps teams connect agent evaluation metrics to the traces, spans, datasets, and experiments behind them.

A metric is only useful when the team can inspect the run behind it. If resolution drops, cost rises, or behavior shifts, teams need the trace that shows what actually happened.

Connect traces, evals, datasets, and experiments

Agent teams often work across a scattered stack.

One team owns the model route. Another owns retrieval. Another owns tools, deployment, or evaluation.

That makes agent metrics hard to use as a shared standard.

The fix is to connect the evaluation loop: datasets, experiments, evaluators, prompts, and production examples.

With datasets and experiments in Arize, teams can test prompt or model changes against curated examples, score outputs with evaluators, and compare runs over time.

That gives internal teams and customer-facing teams the same evaluation record. The metric is no longer floating by itself. It belongs to an example, a run, and a change the team can review.

Evaluate the full agent path

Agent metrics need to map to the actual steps inside the run.

Arize traces break an agent request into spans, so teams can inspect the work behind each outcome.

Useful span types include:

Span type What teams can evaluate
LLM span answer quality, token use, latency, model route
Retriever span source relevance, missing context, repeated search
Tool span tool choice, argument quality, success or failure
Guardrail span blocked outputs, refusals, policy checks
Reranker span ranking quality, source order, retrieval drift
Agent span full path, handoffs, retries, behavior drift

This makes cost, safety, quality, and behavior metrics easier to debug because each score can point back to the span or path that produced it.

Make metrics usable across teams

Agent evaluation creates the most value when the whole team can read, question, and act on the results.

Support, operations, product, finance, and GTM teams may all need AI performance numbers, but most should not need to write queries or inspect raw traces to get them.

Engineers still need the deeper view:

  • raw traces
  • span details
  • JSON payloads
  • model inputs
  • tool outputs
  • evaluator logs

Other teams usually need a cleaner operating view:

  • examples
  • labels
  • scores
  • explanations
  • dashboards
  • failure patterns

That is why the same metric needs multiple surfaces. A behavior metric can appear as a visual path for review, a structured trace for debugging, and a dashboard trend for reporting.

Agent Trajectory views help here because teams can inspect agent execution as a visual map of paths, handoffs, loops, and failure points.

That makes evaluation easier to discuss. The team can point to the path that changed, the step that repeated, or the handoff that failed.

Use agents to move faster and inspect evaluation results

As evaluation data grows, teams need faster ways to find failed traces, inspect span details, create datasets, and compare experiment results. Our AI engineering agent Alyx helps teams search and work across traces, prompts, datasets, evals, and experiments without starting from raw filters every time.

Alyx helps across the evaluation workflow:

  • Find failed traces
  • Inspect span details
  • Build custom evals
  • Create datasets from spans
  • Compare experiment results
  • Search traces in natural language

FAQ: Agent evaluation metrics

How do you choose metrics for different agent types?

Start with the workflow owner’s automation goal.

  • If the workflow owner wants the agent to resolve simple support cases, measure resolution rate and escalation quality.
  • If the goal is code repair, measure test pass rate, build success, and review acceptance.
  • If the goal is research, measure citation accuracy, source relevance, and claim support.

The metric should match the task, the user, and the cost of failure.

How do you measure agent quality?

Agent quality should be measured with both outcome metrics and step-level metrics.

Outcome metrics show whether the user succeeded.

Step-level metrics show whether the agent retrieved the right context, followed instructions, called tools correctly, and completed the workflow cleanly.

For example, a research agent may produce a fluent answer, but still fail if the citation does not support the claim. A support agent may cite the right policy, but still fail if the customer needs to escalate.

What is the difference between agent evaluation and LLM evaluation?

LLM evaluation usually focuses on the quality of a model response. Agent evaluation covers the full workflow: planning, retrieval, tool calls, guardrails, handoffs, retries, final output, cost, and latency. An agent can produce a fluent answer and still fail if it used the wrong tool, skipped an approval step, or relied on unsupported context.

How do you evaluate agent tool use?

Evaluate tool use by checking whether the agent chose the right tool, passed the right arguments, followed permission boundaries, and handled tool failures correctly. Useful metrics include tool-use accuracy, tool success rate, wrong-tool attempts, unsafe tool actions, retry rate, and approval rate.

What should an agent evaluation dashboard include?

An agent evaluation dashboard should start small.

Begin with one outcome metric and one failure metric. The outcome metric proves the agent completed the job. The failure metric shows where the workflow is breaking.

A simple dashboard can track outcome, quality, cost, safety, and behavior. The dashboard should stay small enough for teams to act on it.

Why do agent evaluation metrics change over time?

Fixed metrics fail because agents, users, and workflows change.

Early agent metrics may focus on basic correctness. Once that improves, the bottleneck may move to latency, cost, escalation quality, behavior drift, or edge-case handling.

The metric set should evolve as production failures change.