Agent Evaluation

Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. For one, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? For another, bad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated?

This page will walk you through a framework for navigating these pitfalls.

How do you evaluate an AI Agent?

An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, we must evaluate each of these components.

We've built evaluation templates for every step:

You can evaluate the individual skills and response using normal LLM evaluation strategies, such as Retrieval Evaluation, Classification with LLM Judges, Hallucination, or Q&A Correctness.

Read more to see the breakdown of each component.

How to Evaluate an Agent Router

Routers are one of the most common components of agents. While not every agent has a specific router node, function, or step, all agents have some method that chooses the next step to take. Routers and routing logic can be powered by intent classifiers, rules-based code, or most often, LLMs that use function calling.

A common router architecture

To evaluate a router or router logic, you need to check:

  1. Whether the router chose the correct next step to take, or function to call.

  2. Whether the router extracted the correct parameters to pass on to that next step.

  3. Whether the router properly handles edge cases, such as missing context, missing parameters, or cases where multiple functions should be called concurrently.

Evaluating a router

Example of Router Evaluation:

Take a travel agent router for example.

  1. User Input: Help me find a flight from SF on 5/15

  2. Router function call: flight-search(date="5/15", departure_city="SF", destination_city="")

Eval
Result

Function choice

Parameter extraction

See our Agent Function Calling evaluation template for an implementation example.

How to Evaluate Agent Planning

For more complex agents, it may be necessary to first have the agent plan out its intended path ahead of time. This approach can help avoid unnecessary tool calls, or endless repeating loops as the agent bounces between the same steps.

For agents that use this approach, a common evaluation metric is the quality of the plan generated by the agent. This "quality" metric can either take the form of a single overall evaluation, or a set of smaller ones, but either way, should answer:

  1. Does the plan include only skills that are valid?

  2. Are Z skills sufficient to accomplish this task?

  3. Will Y plan accomplish this task given Z skills?

  4. Is this the shortest plan to accomplish this task?

Given the more qualitative nature of these evaluations, they are usually performed by an LLM Judge.

See our Agent Planning evaluation template for a specific example.

How to Evaluate Agent Skills

An example agent skill

Skills are the individual logic blocks, workflows, or chains that an agent can call on. For example, a RAG retriever skill, or a skill to all a specific API. Skills may be written and defined by the agent's designer, however increasingly skills may be outside services connect to via protocols like Anthropic's MCP.

You can evaluate skills using standard LLM or code evaluations. Since you are separately evaluating the router, you can evaluate skills "in a vacuum". You can assume that the skill was chosen correctly, and the parameters were properly defined, and can focus on whether the skill itself performed correctly.

Some common skill evals are:

Skills can be evaluated by LLM Judges, comparing against ground truth, or in code - depending on the skill.

How to Evaluate an Agent's Path

The steps an agent has taken, stored as messages

Agent memory is used to store state between different components of an agent. You may store retrieved context, config variables, or any other info in agent memory. However, the most common information stored in agent memory is a long of the previous steps the agent has taken, typically formatted as LLM messages.

These messages form the best data to evaluate the agent's path.

You may be wondering why an agent's path matters. Why not just look at the final output to evaluate the agent's performance? The answer is efficiency. Could the agent have gotten the same answer with half as many LLM calls? When each step increases your cost of operation, path efficiency matters!

The main questions that path evaluations try to answer are:

  • Did the agent go off the rails and onto the wrong pathway?

  • Does it get stuck in an infinite loop?

  • Does it choose the right sequence of steps to take given a whole agent pathway for a single action?

One type of path evaluation is measuring agent convergence. This is a numerical value, which is the length of the optimal path / length of the average path for similar queries.

See our Agent Convergence evaluation template for a specific example.

How to Evaluate Agent Reflection

Reflection allows you to evaluate your agents at runtime to enhance their quality. Before declaring a task complete, a plan devised, or an answer generated, ask the agent to reflect on the outcome. If the task isn't accomplished to the standard you want, retry.

See our Agent Reflection evaluation template for a more specific example.

See our Agent Reflection evaluation template for a specific example.

Putting it all Together

Through a combination of the evaluations above, you can get a far more accurate picture of how your agent is performing.

For an example of using these evals in combination, see Evaluating an Agent. You can also review our agent evaluation guide.

Last updated

Was this helpful?