Skip to main content
Agents are harder to evaluate than chatbots. A single LLM call you can score on factuality; an agent loop can take ten steps to get to an answer, and any one of them could be subtly wrong while the final output looks fine. Worse, agents can take inefficient paths and still produce correct results — so coarse “did it work?” metrics miss most of what matters. The fix is to evaluate the agent component by component. Agents share a common architecture; each piece asks a different evaluation question. This page covers the architecture and the questions.

What is an agent, for evaluation purposes

An agent is characterized by three things:
  • What it knows about the world — its memory and the context available to it.
  • The set of actions it can perform — its skills, tools, and APIs.
  • The pathway it took — the sequence of decisions, calls, and intermediate steps.
To evaluate an agent, you evaluate each of these. The standard decomposition has five components, each with a corresponding evaluation question.

Router

The first decision the agent makes is what should I do? The router takes the user’s input plus the conversation memory and either picks a skill, picks a tool, or responds directly. Sometimes routers are LLMs deciding from prompt context; sometimes they’re rule-based. The evaluation questions:
  • Given an input, did the router choose the right skill?
  • Did it extract the right parameter values?
  • Did it correctly decide that no skill should be called (and respond directly), when that was the right answer?
A well-designed router evaluator covers several scenarios:
  • Missing context, short context, and long context.
  • Cases where no function should be called, one should be called, or multiple should be called.
  • Vague or opaque parameters in the query vs. very specific parameters.
  • Single-turn vs. multi-turn conversation pathways.
The router is usually a span-level evaluation — one routing decision per span — and the question is shaped like “was this tool call correct?”

Planner

For tasks with many steps, the router-per-step pattern breaks down. A planner generates the full list of steps up front, then executes them. This avoids the router architecture’s failure modes of either short-circuiting (not calling enough tools) or getting stuck in loops. The evaluation questions for a planner:
  • Does the plan include only skills that are valid?
  • Is the plan a reasonable length (not stuck in a loop, not skipping steps)?
  • Are the available skills sufficient to accomplish the task?
  • Will the plan accomplish the task?
  • Is the plan the shortest one that accomplishes the task?
Plans can be evaluated by heuristic (counting steps, checking skill validity) or by LLM-as-a-judge (asking whether the plan will work). A common pattern: if the LLM judge says the plan is bad, regenerate up to a retry limit before executing. Planner evaluators are typically span-level — one plan per span — but the question is more structural than a router eval.

Skills

Skills are the individual logic blocks the agent can call — an SQL query, a RAG retriever, an image generator, an API call. They’re how the agent gets information and takes action in the world. For evaluating skills, assume the skill was called correctly. The router’s job is to choose; the skill’s job is to execute well given correct inputs. So skill evaluation is “given this input, was the output any good?” Common skill evaluators:
  • Retrieval relevance — did the RAG retriever return useful documents for this query?
  • QA correctness — given the retrieved context, is the answer right?
  • Hallucination — does the answer claim things the retrieved context doesn’t support?
  • User frustration — does the user appear unhappy with the skill’s output?
For canonical templates, see LLM-as-a-judge evaluators. Skill evaluators are usually span-level — one skill call per span. They’re often the densest layer of evaluators in a production agent system because there are usually many skills, and each one fails in its own way.

Memory

Memory is the shared state the agent reads from and writes to between steps. It holds retrieved context, configuration variables, previous execution steps, and intermediate results. The LLM at each step gets the relevant memory in its context window. Memory evaluators ask about the shape of the agent’s path, not any individual step:
  • Did the agent stay on the right pathway, or did it veer off?
  • Did it get stuck in an infinite loop?
  • Did it choose the right sequence of steps?
The standard metric is agent convergence, which measures length_of_optimal_path / length_of_actual_path for similar queries. Lower convergence means the agent is wandering; higher means it’s efficient. See Agent trajectory evaluations for the canonical version. Memory evaluators are typically trace-level — they need to see the whole sequence of spans to judge the path — and sometimes session-level when the agent’s behavior across conversation turns matters.

Reflection

Reflection is the optional last step: before calling a task done, the agent reviews its own output and decides whether to accept it or retry. This is essentially evaluation at runtime — the agent evaluates itself before the user sees the answer. The evaluation questions for reflection:
  • Did the agent’s self-review correctly identify a flawed answer?
  • When it decided to retry, did the retry actually improve things?
  • When it decided not to retry, was the output actually good?
Reflection evaluators are span-level (one reflection per span) and often pair-wise — comparing the pre-reflection and post-reflection outputs to see whether the loop helped. The Anthropic Building Effective Agents writeup has a deeper treatment of reflection as a pattern.

Putting it together

A worked example. An e-commerce agent that helps users find and purchase products has all five components and would typically be evaluated like this:
ComponentEval levelEval template (example)
RouterSpanDid the router pick the right tool — product search, cart-add, checkout?
PlannerSpanIs the plan a reasonable shopping flow?
SkillsSpanDid the product-search return relevant items? Did the cart-add use the right SKU?
MemoryTraceDid the agent take an efficient path from “find me a blue sweater” to “purchased”?
ReflectionSpanWhen the agent re-read its own response, did it correctly catch the case where the wrong SKU was added?
Most production agents end up running 5–10 evaluators across these five layers. They’re not redundant — each catches a different failure mode. Routers can pick the wrong skill; skills can fail with the right input; the agent can pick the right skills in the wrong order. A single high-level “did it work?” eval misses all of this.

Further reading

Arize AI AgentsBroader Arize content on the agent eval problem space.
Building Effective Agents (Anthropic)The reference writeup on reflection and other agent design patterns.
AgenticRAG SurveyA survey of the academic literature on agent + RAG.
Agent OverviewChip Huyen’s blog post on the agent landscape.
Gorilla LeaderboardA benchmark for evaluating tool-using models.

Next step

You now know the four evaluator types, the three levels, the two filter slots, and the architectural decomposition for agents. The next page is about how to make any of them good — the catalog of best practices and the citations behind them:

Next: Evaluator Best Practices