Agent Evaluation
Router
The first thing an agent needs to figure out is given a query, what should it do? One way is to ask it every time — use the LLM to act based on its inputs and memory. This “router” architecture is quite flexible. Sometimes it is also built using rules.
- Missing context, short context, and long context
- No functions should be called, one function should be called, or multiple functions should be called
- Vague or opaque parameters in query, vs. very specific parameters in query.
- Single turn vs. multi-turn conversation pathways
Planner
For complex tasks with a large number of steps, it can be better to come up with the list of steps first and evaluate it, instead of generating each step one at a time. The router architecture can either short circuit and not call enough tools, or it gets stuck in loops and generates hundreds of steps and waste lots of time and energy.
- Does the plan include only skills that are valid?
- Is the plan less than X number of steps?
- Are Z skills sufficient to accomplish this task?
- Will Y plan accomplish this task given Z skills?
- Is this the shortest plan to accomplish this task?
Skills

- Retrieval Relevance
- QA Correctness
- Hallucination
- User frustration
Memory

- Did the agent go off the rails and onto the wrong pathway?
- Does it get stuck in an infinite loop?
- Does it choose the right sequence of steps to take given a whole agent pathway for a single action?
length of the optimal path / length of the average path for similar queries. See our Agent convergence evaluation template for a specific example.
Reflection
After a task is completed, a plan is created, or an answer is generated, before calling it done, it can be helpful to ask the agent to reflect on the final output and whether it accomplished the task. If it did not, then retry. This is like running evaluations at runtime, instead of after the fact, which can improve the quality of your agents. See our Agent Reflection evaluation template for a more specific example.