What is an agent, for evaluation purposes
An agent is characterized by three things:- What it knows about the world — its memory and the context available to it.
- The set of actions it can perform — its skills, tools, and APIs.
- The pathway it took — the sequence of decisions, calls, and intermediate steps.
Router
The first decision the agent makes is what should I do? The router takes the user’s input plus the conversation memory and either picks a skill, picks a tool, or responds directly. Sometimes routers are LLMs deciding from prompt context; sometimes they’re rule-based. The evaluation questions:- Given an input, did the router choose the right skill?
- Did it extract the right parameter values?
- Did it correctly decide that no skill should be called (and respond directly), when that was the right answer?
- Missing context, short context, and long context.
- Cases where no function should be called, one should be called, or multiple should be called.
- Vague or opaque parameters in the query vs. very specific parameters.
- Single-turn vs. multi-turn conversation pathways.
Planner
For tasks with many steps, the router-per-step pattern breaks down. A planner generates the full list of steps up front, then executes them. This avoids the router architecture’s failure modes of either short-circuiting (not calling enough tools) or getting stuck in loops. The evaluation questions for a planner:- Does the plan include only skills that are valid?
- Is the plan a reasonable length (not stuck in a loop, not skipping steps)?
- Are the available skills sufficient to accomplish the task?
- Will the plan accomplish the task?
- Is the plan the shortest one that accomplishes the task?
Skills
Skills are the individual logic blocks the agent can call — an SQL query, a RAG retriever, an image generator, an API call. They’re how the agent gets information and takes action in the world. For evaluating skills, assume the skill was called correctly. The router’s job is to choose; the skill’s job is to execute well given correct inputs. So skill evaluation is “given this input, was the output any good?” Common skill evaluators:- Retrieval relevance — did the RAG retriever return useful documents for this query?
- QA correctness — given the retrieved context, is the answer right?
- Hallucination — does the answer claim things the retrieved context doesn’t support?
- User frustration — does the user appear unhappy with the skill’s output?
Memory
Memory is the shared state the agent reads from and writes to between steps. It holds retrieved context, configuration variables, previous execution steps, and intermediate results. The LLM at each step gets the relevant memory in its context window. Memory evaluators ask about the shape of the agent’s path, not any individual step:- Did the agent stay on the right pathway, or did it veer off?
- Did it get stuck in an infinite loop?
- Did it choose the right sequence of steps?
length_of_optimal_path / length_of_actual_path for similar queries. Lower convergence means the agent is wandering; higher means it’s efficient. See Agent trajectory evaluations for the canonical version.
Memory evaluators are typically trace-level — they need to see the whole sequence of spans to judge the path — and sometimes session-level when the agent’s behavior across conversation turns matters.
Reflection
Reflection is the optional last step: before calling a task done, the agent reviews its own output and decides whether to accept it or retry. This is essentially evaluation at runtime — the agent evaluates itself before the user sees the answer. The evaluation questions for reflection:- Did the agent’s self-review correctly identify a flawed answer?
- When it decided to retry, did the retry actually improve things?
- When it decided not to retry, was the output actually good?
Putting it together
A worked example. An e-commerce agent that helps users find and purchase products has all five components and would typically be evaluated like this:| Component | Eval level | Eval template (example) |
|---|---|---|
| Router | Span | Did the router pick the right tool — product search, cart-add, checkout? |
| Planner | Span | Is the plan a reasonable shopping flow? |
| Skills | Span | Did the product-search return relevant items? Did the cart-add use the right SKU? |
| Memory | Trace | Did the agent take an efficient path from “find me a blue sweater” to “purchased”? |
| Reflection | Span | When the agent re-read its own response, did it correctly catch the case where the wrong SKU was added? |
Further reading
| Arize AI Agents | Broader Arize content on the agent eval problem space. |
| Building Effective Agents (Anthropic) | The reference writeup on reflection and other agent design patterns. |
| AgenticRAG Survey | A survey of the academic literature on agent + RAG. |
| Agent Overview | Chip Huyen’s blog post on the agent landscape. |
| Gorilla Leaderboard | A benchmark for evaluating tool-using models. |