Agents in the Wild
Introduction
2025 is the year of agents. Companies all over the world are experimenting with chatbot agents, tools like Anthropic computer use and OpenAI Operator have turned heads by connecting agents to outside websites, and frameworks like LangGraph, CrewAI, and LlamaIndex Workflows are helping developers around the world build structured agents.
However, despite their popularity, agents have yet to make a strong splash outside of the AI ecosystem. Very few agents have taken off among either consumer or enterprise users.
With this in mind, we decided it was time to help teams navigate the new frameworks and new agent directions. What tools are available, and which should you use to build your next application? How can you evaluate and improve your agent?
Our other motivation behind this series is that our team built our own complex agent to act as a copilot within our Arize platform. We took a TON of learnings away from this process, and now feel more qualified to offer our opinion on the current state of AI agents.
In this handbook, we’ll deep dive on each of these topics and more. We’re hopeful that this resource can help arm AI engineers everywhere to build the best agents possible.
What are AI Agents?
Before we can turn to the advanced questions like evaluating agents and comparing frameworks, let’s examine the current state of agent architectures.
To align ourselves before we jump in, it helps to define what we mean by an agent. LLM-based agents are software systems that string together multiple processing steps, including calls to LLMs, in order to achieve a desired end result. Agents typically have some amount of conditional logic or decision-making capabilities, as well as a working memory they can access between steps.
In this first post, we’ll deep dive into how agents are built today, the current problems with modern agents, and some initial solutions.
The Failure of ReAct Agents
Let’s be honest, the idea of an Agent isn’t new. There were countless agents launched on AI Twitter over the last year claiming amazing feats of intelligence. This first generation were mainly ReAct (reason, act) agents. They were designed to abstract as much as possible, and promised a wide set of outcomes.
Unfortunately, this 1st generation of agent architectures really struggled. Their heavy abstraction made them hard to use, and despite their lofty promises, they turned out to not do much of anything.
In reaction to this, many people began to rethink how agents should be structured. In the past year we’ve seen great advances, now leading us into the next generation of agents.

What Separates This Second Generation of Agents?
This new generation of agents is built on the principle of defining the possible paths an agent can take in a much more rigid fashion, instead of the open-ended nature of ReAct. Whether agents use a framework or not, we have seen a trend towards smaller solution spaces – aka a reduction in the possible things each agent can do. A smaller solution space means an easier to define agent, which often leads to a more powerful agent.
This second generation covers many different types of agents, however it’s worth noting that most of the agents or assistants we see today are written in code without frameworks, have an LLM router stage, and process data in iterative loops.

Why Do You Need An AI Agent?
Agents cover a broad range of systems. There is so much hype about what is “agentic” these days. How can you decide whether you actually need an agent? What types of applications even require an agent?
For us, it boils down to three criteria:
- Does your application follow an iterative flow based on incoming data?
- Does your application need to adapt and follow different flows based on previously taken actions or feedback along the way?
- Is there a state space of actions that can be taken? The state space can be traversed in a variety of ways, and is not just restricted to linear pathways.
Common Issues with Agents Today
Agents powered by LLMs hold incredible potential, yet they often struggle with several common pitfalls that limit their effectiveness.
One major challenge is long-term planning. While these models are powerful, they often falter when it comes to breaking down complex tasks into manageable steps. The process of task decomposition can be daunting, leading agents to get stuck in loops, unable to progress beyond certain points. This lack of foresight means they can lose track of the overall goal, requiring manual correction to get back on course.
Another issue is the vastness of the solution space. The sheer number of possible actions and paths an agent can take makes it difficult to consistently achieve reliable outcomes. As a result, the results can be inconsistent, leading to unpredictability in performance. This inconsistency not only undermines the reliability of agents but also makes them expensive to run, as more resources are required to achieve the desired results.
It is worth mentioning that current agent trends have pushed the market further towards constrained agents that can only choose from a set of possible actions, effectively limiting the solution space.
Finally, agents often struggle with malformed tooling calls. When faced with poorly structured or incorrect inputs, they can’t easily recover, leading to failures that require human intervention. These challenges highlight the need for more robust solutions that can handle the complexities of real-world applications while delivering reliable, cost-effective results.
Addressing these Challenges
One of the most effective strategies is to map the solution space beforehand. By thoroughly understanding and defining the range of possible actions and outcomes, you can reduce the ambiguity that agents often struggle with. This preemptive mapping helps guide the agent’s decision-making process, narrowing down the solution space to more manageable and relevant options.
Incorporating domain and business heuristics into the agent’s guidance system is another powerful approach. By embedding specific rules and guidelines related to the business or application, you can provide the agent with the context it needs to make better decisions. These heuristics act as a compass, steering the agent away from less relevant paths and towards more appropriate solutions, which is particularly useful in complex or specialized domains.
Being explicit about action intentions is also crucial. Clearly defining what each action is intended to accomplish ensures that the agent follows a logical and coherent path. This explicitness helps prevent the agent from getting stuck in loops or deviating from its goals, leading to more consistent and reliable performance. Modern frameworks for agent development luckily encourage this type of strict definition.
Creating a repeatable process is another key strategy. By standardizing the steps and methodologies that agents follow, you can ensure that they perform tasks consistently across different scenarios. This repeatability not only enhances reliability but also makes it easier to identify and correct errors when they occur.
Finally, orchestrating with code and more reliable methods rather than relying solely on LLM planning can dramatically improve agent performance. This involves swapping your LLM router for a code-based router where possible. By using code-based orchestration, you can implement more deterministic and controllable processes, reducing the unpredictability that often comes with LLM-based planning.
What Makes Up An Agent?
Many agents have a node or component we call a router, that decides which step the agent should take next. In our assistant we have multiple fairly complex router nodes. The term router normally refers to an LLM or classifier making an intent decision of what path to take. An agent may return to this router continuously as they progress through their execution, each time bringing some updated information. The router will take that information, combine it with its existing knowledge of the possible next steps, and choose the next action to take.
The router itself is sometimes powered by a call to an LLM. Most popular LLMs at this point support function calling, where they can choose a component to call from a JSON dictionary of function definitions. This ability makes the routing step easy to initially set up. As we’ll see later however, the router is often the step that needs the most improvement in an agent, so this ease of setup can bely the complexity under the surface.
Each action an agent can take is typically represented by a component. Components are blocks of code that accomplish a specific small task. These could call an LLM, or make multiple LLM calls, make an internal API call, or just run some sort of application code. These go by different names in different frameworks. In LangGraph, these are nodes. In LlamaIndex Workflows, they’re known as steps. Once the component completes its work, it may return to the router, or move to other decision components.
Depending on the complexity of your agent, it can be helpful to group components together as execution branches or skills. Say you have a customer service chatbot agent. One of the things this agent can do is check the shipping status of an order. To functionally do that, the agent needs to extract an order id from the user’s query, create an api call to a backend system, make that api, parse the results, and generate a response. Each of those steps may be a component, and they can be grouped into the “Check shipping status” skill.
Finally, many agents will track a shared state or memory as they execute. This allows agents to more easily pass context between various components.
AI Agent Architecture
There are some common patterns we see across agent deployments today. We’ll walk through an overview of all of those architectures in the following pieces but the below examples are probably the most common.
In its simplest form an agent or assistant might just be defined with a LLM router and a tool call. We call this first example a single router with functions. We have a single router, that could be an LLM call, a classifier call, or just plain code, that directs and orchestrates which function to call. The idea is that the router can decide which tool or functional call to invoke based on input from the system. The single router comes from the fact that we are using only 1 router in this architecture.
A slightly more complicated assistant we see is a single router with skills. In this case, rather than calling a simple tooling or function call, the router can call a more complex workflow or skill set that might include many components and is an overall deeper set of chained actions. These components (LLM, API, tooling, RAG, and code calls) can be looped and chained to form a skill.
This is probably the most common architecture from advanced LLM application teams in production today that we see.

The general architecture gets more complicated by mixing branches of LLM calls with tools and state. In this next case, the router decides which of its skills (denoted in red) to call to answer the user’s question. It may update the shared state based on this question as well. Each skill may also access the shared state, and could involve one or more LLM calls of its own to retrieve a response to the user.
This is still generally straightforward, however, agents are usually far more complex. As agents become more complicated, you start to see frameworks built to try and reduce that complexity.
Routers
An agent router serves as the decision-making layer that manages how user requests are routed to the correct function, service, or action within a system. This component is particularly vital in large-scale conversational systems where multiple intents, services, and actions are involved.
Routers are not used by all agents however. Some frameworks like LangGraph or OpenAI Swarm instead spread the job of routing across nodes within an agent.

When to Implement an Agent Router
Agent routers prove particularly valuable in several scenarios:
- Systems with multiple service integrations, including various APIs, databases, or microservices
- Applications handling diverse types of user input, especially in NLP-based systems
- Architectures requiring modular, scalable design patterns
- Systems needing sophisticated error handling and fallback mechanisms
Generally speaking, more complex and/or non-deterministic agents benefit from routers.
Implementation Approaches
Routers typically use one of three different techniques to handle their core routing function:
Function Calling with LLMs
This approach uses an existing LLM to chose between a set of available functions, each representing a skill or branch in the agent. Most modern LLMs are now equipped with function calling capabilities, and as a result you’ll often see agents using GPT-4o, Claude 3.5 Sonnet, or Llama models as routers.
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
Advantages of this approach:
- Dynamic and flexible processing of complex user inputs
- Minimal routing logic requirements
Some challenges of function calling:
- Higher latency due to real-time LLM processing
- Resource-intensive operations
- Limited control over granular routing logic
- Complexity in implementing fallback strategies
Function calling routers are the most flexible routing option, however they are also the hardest to control. Introducing a function calling router means introducing another stoichastic LLM call that you need to manage. Depending on your agent, the extra flexibility this method adds may be worth it, but if your agent’s routing can be handled using one of the methods below, you’ll reduce your overall testing burden.
Intent-Based Routing
Intent routers identify user intentions from queries and map them to predefined functions or services. This approach is prevalent in chatbots and virtual assistants where user queries must be categorized into distinct intents.
This approach works well when your agent has a somewhat limited set of capabilities, each which can be mapped to a distinct intent. This technique can struggle with more nuanced intents however.

Advantages:
- Clear structural separation between user input and backend processes
- Straightforward debugging and scaling capabilities
- Easy extension of routing logic for new intents
Challenges:
- Limited flexibility with ambiguous queries
- Difficulty handling requests outside predefined categories
Pure Code Routing
For simpler systems, implementing routing logic directly in the application code without external AI or NLP models can be effective. This approach involves hardcoded routing decisions based on predetermined patterns or rules.
This approach is obviously more limited by the needs of your agent, however if you are working with an agent who’s routing can be hard-coded in this way, pure code routing is recommended.
def classify_intent(user_input):
# Basic rule-based classification
if "SELECT" in user_input.upper() or "FROM" in user_input.upper():
return "execute_sql"
elif "analyze" in user_input.lower() or "summarize" in user_input.lower():
return "analyze_data"
return "unknown"
Advantages:
- Superior performance and efficiency
- Complete control over routing logic
- Optimization capabilities for specific use cases
Challenges:
- Limited flexibility
- Scaling difficulties
- Significant rework required for system modifications
Best Practices for Implementation
- Scope Management: Maintain focused and limited scope for router components. Breaking complex tasks into smaller, manageable skills improves execution accuracy and system reliability. This modular approach ensures that each component has a clear, single responsibility.
- Clear Guidelines: Develop comprehensive function calling guidelines and explicit router definitions. Well-documented tool descriptions significantly enhance function call accuracy and system maintainability. This documentation serves as a crucial reference for both development and maintenance phases.
- Performance Monitoring: Implement robust monitoring solutions to track router performance and system behavior. Tools like Arize Phoenix can provide detailed visibility into application operations, helping identify optimization opportunities and potential issues before they impact users.
Making the Right Choice
The selection of an agent routing approach should be guided by several key factors:
- System complexity requirements
- Scalability needs
- Performance constraints
- Maintenance considerations
Whether implementing function calling with LLMs, intent routing, or pure code solutions, ensure your routing logic maintains modularity, scalability, and ease of maintenance. Consider starting with a simpler approach and evolving the system based on actual usage patterns and performance metrics.
An effective agent router is fundamental to building robust AI systems that can handle diverse user requests efficiently. By carefully considering implementation approaches and following established best practices, you can create a routing system that balances flexibility, performance, and maintainability. Regular monitoring and iterative improvements will ensure your agent router continues to meet your system’s evolving needs.
Skills
Agent skills are the fundamental building blocks of an agent’s functionality. These are discrete capabilities that an agent’s router activates based on user intent, allowing the agent to fulfill a wide range of tasks. By compartmentalizing functionality into specific skills, agents can execute complex workflows in a modular, efficient, and maintainable way.
What Are Skills?
Skills are self-contained operations or tasks that an agent can perform. Each skill is designed to address a specific type of user intent and operates as a reusable, independent unit. Think of skills as tools in a toolbox—each tool serves a distinct purpose, and the router selects the right tool for the job.

What Can Skills Be Made Of?
Skills are not limited to a single type of operation. Instead, they can consist of a variety of components:
- Other LLM Calls: Skills may involve chaining language model outputs to refine responses, summarize information, or generate content.
- API Calls: Skills can interact with external APIs to retrieve or send data. For example, a weather query skill might call a weather API to fetch real-time conditions.
- Pure Application Code: Some skills are implemented directly in code to handle specific business logic or computational tasks without relying on external services.
- Database Queries: Skills can connect to structured data sources, running SQL queries or interfacing with NoSQL databases to fetch relevant information.
- Multi-Step Workflows: A single skill may involve orchestrating multiple subtasks, such as combining data retrieval, LLM processing, and validation logic.
Common Types of Skills
Below are some commonly implemented agent skills across different use cases:
- Retrieval-Augmented Generation (RAG): Integrates with knowledge bases or document stores to retrieve contextually relevant data and enhance LLM outputs.
- API Interaction: Fetches external information (e.g., weather, stock prices, shipping data) or triggers external actions (e.g., sending emails, creating tickets).
Code Generation and Execution: Produces and executes code snippets dynamically, often for automating repetitive tasks or data processing. - Database Querying: Enables agents to interact with structured data, performing queries to retrieve or update records.
- Reflection and Iteration: Allows agents to review their prior responses, analyze them for errors or improvements, and generate updated outputs.
- Data Analysis: Processes datasets to generate summaries, visualizations, or statistical insights.
Workflow Automation: Manages multi-step processes by coordinating several sub-skills to achieve an overarching goal.
How Skills Fit Into Agent Architectures
In agent architectures, skills operate as discrete, closed units. This modularity ensures that skills can be developed, tested, and maintained independently.
Here’s how they fit into the broader system:
- Interaction with Memory: Skills may read from or write to an agent’s memory. For example, a conversation history skill might log user interactions, while a summarization skill might use memory to generate context-aware responses.
- Interaction with State: Skills often update or depend on the agent’s state, which reflects the current session’s context or task-specific variables.
- Role of the Router: The router serves as the decision-making component, analyzing user intent and determining which skill to invoke. Importantly, routers treat skills as black boxes—they don’t need to understand the internal workings of a skill, only what it does and when to call it.
Best Practices for Developing Agent Skills
- Design for Modularity: Ensure each skill performs a single, well-defined function. This makes skills easier to debug, test, and reuse across different agents.
- Follow Clear Interfaces: Define input and output standards for each skill. Consistency across skills allows the router to call them seamlessly.
- Incorporate Error Handling: Skills should gracefully handle failures, such as API timeouts or invalid inputs, and provide useful error messages to the router.
- Optimize for Performance: Avoid unnecessary complexity in skills. For instance, cache frequently accessed data to reduce redundant calls.
- Document Skills Thoroughly: Provide clear documentation for each skill, including its purpose, inputs, outputs, and dependencies. This ensures easy onboarding for developers and smooth integration into agent workflows.
- Test in Isolation: Validate each skill independently before integrating it into an agent. Unit tests can help ensure reliability and catch edge cases early.
- Enable Observability: Incorporate logging and monitoring into skills to track their performance and debug issues effectively.
By following these principles, you can create robust, adaptable agent skills that enhance the overall capabilities of your agents and make them more maintainable over time.
Memory and State
The term “memory” is often used in discussions about LLM applications, but it’s an abstraction that can mean different things to different people. Broadly, memory refers to any mechanism by which an LLM application stores and retrieves information for future use. This can encompass state in its many forms:
- Persisted state: Data stored in external databases or other durable storage systems
- In-application state: Information retained only during the active session, which disappears when the application restarts.
What Is State in LLM Applications?
State is the mechanism which a system retains and uses information over time, connecting past interactions to current behavior. In software engineering, state can be categorized into two main types:
- Stateful systems retain context
Stateless systems treat each interaction independently, with no memory of prior exchanges. - LLM models are stateless. They process each query as a standalone task, generating outputs based solely on the current input. This simplifies model architecture but poses challenges for LLM applications that require context continuity – especially agents.
For example, in a chatbot, messages within the same session may appear stateful, as they influence ongoing interactions. However, once the session ends, the application reverts to a stateless design, if there is no information persisted across sessions.
Agents on the other hand, are typically stateful. Each action an agent takes is informed by previous actions, responses, or user messages.
Why Managing Memory & State is Essential
Applications require state to deliver consistent, coherent, and efficient user experiences. Without it:
- Users are forced to repeat information
- Applications can’t adapt to ongoing context
- Processing redundant information increases costs
Managing state requires balancing the things you need, and the things you don’t for the particular application.
LLM State Management Considerations
If you’ve built your own agent, you’ll have full control over what is stored in your state. Here are a few useful factors to consider when establishing your agent’s state.
1. How long should the information be retained?
- The context in which state should be carried will affect the design of your state management system, illustrative examples below:
- End User Examples
- State Across Messages:
- Maintains continuity during an ongoing session but resets once the session ends. Useful for chatbots, iterative problem-solving, or workflows.
- State Across Sessions:
- Persists information across sessions, enabling personalization or long-term task tracking. Essential for applications like virtual assistants or collaborative tools.
- State Across Messages:
- Agent Examples
- State Across Tool Calls
- Does each tool execution need to carry over context, or can the tools operate independently?
- State Across Multiple Agents
- In multi-agent systems, should agents share state, or should they operate with isolated knowledge?
- State Across Tool Calls
2. What is the available context window in your LLM?
LLMs have a fixed context window, limiting how much information they can process at once. As the input grows, performance can degrade, costs increase, and hallucinations become more frequent. State management strategies must account for these limitations to ensure relevance without exceeding token context windows.
3. Costs
Stateful designs often improve user experience and application performance but come with increased costs. Persisted data and larger prompts increase storage and processing demands. Balancing the cost of maintaining state against its benefits is critical for sustainable applications.
4. LLM Application Performance
Depending on the outcome you are trying to produce, some folks might need more complex state management to actually achieve the outcome they desire while others can implement something more simple and still achieve their required results. Some can go with a more brute force approach while others might need to refine the state system.
Storing Conversation History
Conversation History is the most common information stored as state. Indeed, if you’re using a LLM function calling router, you may be required to pass in previous messages to even get a response.
The most basic approach to storing conversation history involves including all past messages—user inputs, model outputs, tool responses—in subsequent prompts. While straightforward and intuitive, it can quickly run into problems:
- Degraded performance: Large prompts reduce model quality.
- High costs: Token usage grows quadratically with conversation length.
- Context limits: Long histories can exceed the model’s capacity.
This approach works for short interactions but scales poorly in more complex applications. A more nuanced approach is to use a sliding window.
A sliding window retains only the most recent messages, discarding older ones to maintain a fixed context size.
- Benefits: Keeps context relevant, stays within model limits, and controls token costs.
- Drawbacks: Risks losing important earlier details and struggles with long-term dependencies.
Sliding windows balance efficiency and relevance but may require supplementation for applications needing deeper context. That supplementation typically comes in the form of summarization.
Summarization condenses past interactions into a compact form. This reduces token usage but has limitations:
- Loss of detail: Important nuances may be omitted.
- Uncertainty: Summaries may not include information needed in future interactions.
Summarization works best as part of a hybrid strategy rather than a standalone solution.
Combining Strategies
Blending approaches—such as combining recent message windows with past summaries—offers a practical balance. This ensures recent context is preserved while retaining essential historical information.
These foundational techniques provide a starting point for managing state effectively, tailored to the specific needs of your application.
Storage Types
Choosing the right storage type depends on the persistence requirements, cost, and latency of your application. Here’s an example below
- Ephemeral conversation history:
- Temporarily stored in application memory, used during a single session. For example, a shopping assistant might track interactions in real time without needing to persist data for future sessions.
- Persistent conversation history:
- Stored in a durable database or vector store, enabling access to prior context across sessions. This is essential for applications like personal assistants that need long-term memory for personalization.
- The persistence storage medium—whether a database, blob storage, knowledge graph, or retrieval-augmented generation (RAG) system—should align with your application’s complexity, latency requirements, and cost constraints.
Collaboration / Organization Approaches
In complex problem-solving scenarios, deploying multiple AI agents can enhance efficiency and effectiveness. This section delves into multi-agent frameworks, common collaboration strategies, and guidelines for choosing between multi-agent and single-agent approaches.
Multi-Agent Frameworks
Multi-agent frameworks provide the infrastructure for developing, deploying, and managing systems where multiple agents interact. These frameworks offer tools and protocols to facilitate communication, coordination, and collaboration among agents. Notable frameworks include:
- CrewAI: A framework built on top of LangGraph that allows developers to build multi-agent automations from scratch, with tools for both coding and no-code environments.
- Autogen: An open-source framework, similar to CrewAI, that allows for individual agents to be defined and structured in different collaborative workflows.
- OpenAI Swarm: An open-source, experimental framework designed for lightweight multi-agent orchestration, supporting seamless handoff of tasks between agents.
Common Agent Collaboration Approaches
Effective collaboration among agents is crucial for achieving complex objectives. Several organizational strategies are commonly employed:
Two-Agent Chat
In this approach, two agents collaborate in a paired conversation, continuously exchanging information and refining their outputs based on each other’s responses. This structure is effective when two specialized areas of expertise need to be integrated dynamically.

- Example: In a legal document review system, one agent specializes in summarizing contracts, while the second agent evaluates legal risks. The summarization agent first condenses the text, then the legal risk agent assesses and highlights potential issues. The two agents iterate on refining the final output based on their combined insights.
- Benefits: Provides a balanced, iterative approach where each agent complements the other. Works well for scenarios requiring deep interaction between two areas of expertise.
- Challenges: Can become inefficient if the agents get stuck in excessive back-and-forth refinements without a clear stopping criterion. Ensuring that each agent contributes meaningful value to the other’s work is crucial.
This setup is particularly useful when one agent’s task directly benefits from iterative refinement by another agent, such as creative writing (idea generation + editing) or code generation (writing + debugging).
Group Chat
In this approach, all agents communicate in a shared environment, exchanging information and coordinating actions collectively. This method is particularly effective when tasks require ongoing collaboration and shared context among agents.
- Example: In a customer support scenario, agents specialized in different areas (e.g., technical issues, billing, product inquiries) operate in a shared “chat room.” A query from a customer is visible to all agents, who can collaborate in real time to resolve complex questions without missing critical details.
- Benefits: High transparency and fluid communication; all agents are aware of the broader context, which helps avoid redundant work.
- Challenges: As the number of agents grows, communication noise can increase, making it harder to maintain focus.
Managed Group Chats
A designated agent acts as a “manager” or coordinator within the group chat. This agent oversees the conversation, directing queries to the most appropriate agents and synthesizing responses. The manager ensures efficiency and avoids redundant efforts.

- Example: In a database chat application, a virtual “manager” agent might coordinate between specialized agents for data querying and analysis. The manager ensures that each agent contributes only when needed and consolidates their inputs into a cohesive response for the user.
- Benefits: Reduces communication noise and enhances coordination. Ensures the right agent handles the right task.
- Challenges: The manager agent becomes a critical component and can introduce a bottleneck if poorly designed.
Sequential
Agents operate in a predefined order, with the output of one agent serving as the input for the next. This ensures a structured flow of information and avoids conflicting actions.

- Example: Consider a social media content system to distribute blogs. The first agent proof reads and edits a supplied blog, and a second agent writes social media copy to promote the piece. Finally, a third agent handles the API calls to publish the resulting piece on different platforms.
- Benefits: Well-suited for workflows where steps must follow a logical progression.
- Challenges: Errors made by an upstream agent can cascade downstream, so ensuring quality control at each step is critical.
Hierarchical
Agents are organized in a tiered structure, with higher-level agents supervising and delegating tasks to subordinate agents. This structure is particularly effective for solving complex, multi-faceted problems requiring different levels of abstraction.

- Example: In an e-commerce personalization system, a top-level “strategy agent” decides the overall marketing approach. It delegates tasks like creating personalized product recommendations, generating discount codes, or crafting marketing copy to specialized lower-level agents.
- Benefits: Allows for clear division of labor and decision-making across different levels of expertise. Higher-level agents can maintain a broad perspective while delegating specialized tasks.
- Challenges: Requires careful coordination to ensure subordinate agents’ outputs align with high-level goals. Miscommunication or misalignment at any level can impact results.
When to Use a Multi-Agent Approach Over a Single-Agent Approach
Choosing between a multi-agent and a single-agent approach depends on the complexity and nature of the task:
- Complexity: Tasks involving multiple facets or requiring diverse expertise benefit from specialized agents handling different components.
- Scalability: Multi-agent systems can distribute workloads, enhancing performance in large-scale applications.
- Adaptability: In dynamic environments, multiple agents can independently adapt to changes, maintaining overall system robustness.
- Redundancy and Reliability: Multiple agents can provide redundancy, increasing fault tolerance and reliability.
Conversely, for straightforward tasks with limited scope, a single-agent approach may suffice, offering simplicity and reduced resource consumption.
AI Agent Development Frameworks
Benefits and Drawbacks of Agent Frameworks
There are many popular agent frameworks, with new entrants launching every week. This frameworks usually obfuscate some of the more complex aspects of building an agent, including state, routing, and skill descriptions.
Each framework is unique in some way, but comes with similar benefits:
- Ease of set up for basic use cases
- Documented examples – especially true of popular frameworks, it’s often easier to find example projects using a framework than ones using none.
- Connection to existing orchestration libraries – popular orchestration libraries like Langchain and LlamaIndex have their own agent frameworks. If you’re already using one of these libraries, extending to use their agent modules is a natural approach.
The main drawback to using a framework is that, once your agent gets complicated enough, you’ll begin fighting with the framework itself to do what you want.
Should you use a framework to develop your agent?
Regardless of the framework you use, the additional structure provided by these tools can be extremely helpful especially in building out agent applications. The question of whether using one of these frameworks is beneficial when creating larger, more complicated applications is a bit more interesting.
We have a fairly strong opinion in this area because we’ve built an assistant ourselves. Our assistant uses a complex, multi-layer router architecture with branches and steps that echo some of the abstractions of the current frameworks. We started building our assistant long before most frameworks were stable and it exists at a complexity level beyond anything that we’ve seen built with the current frameworks. We constantly ask ourselves, if we were starting from scratch, would we use the current framework abstractions? Are they up to the task?
The current answer from us is not yet. There is just too much complexity in the overall system that doesn’t lend itself to a Pregel-based architecture. If you squint at our assistant, you can map it to nodes and edges but the software abstraction for us would get in the way. As it stands, our team tends to prefer code over frameworks.
We do however, see the value in the agent framework approaches. Namely it does force an architecture that has some best practices and good tooling. We also do think they are getting better constantly, expanding where they are useful and what you can do with them. It is very likely that our answer may change in the near future as these frameworks improve.
Evaluating AI Agents
Building a good agent is hard.
Setting up a basic agent on the other hand, is straightforward – especially if you use a framework and a common architecture. However, the difficulty lies in taking that basic agent and turning it into a robust, production-ready tool. This is also where much of the value in your agent comes from.
Evaluation is one of the main tools that will help you transform your agent from a simple demo project into a production tool. Using a thoughtful and structured approach to evaluation is one of the easiest ways to streamline this otherwise challenging process.
Agent development is cyclical, not linear
Building any LLM application involves some amount of cyclical iteration, and agents are no different. It is impossible to anticipate all of the queries your agent will receive, or all of the possible outputs of your models. Only by properly monitoring your production system and integrating the data generated by it into your development processes can you create a truly robust system.
That cycle of iteration typically involves:
- Creating an initial set of representative test cases
- Breaking down your agent into individual steps (e.g. router, skill 1, skill 2, etc.)
- Creating evaluators for each step
- Experimenting with different iterations of your agent, maximizing your eval scores
- Monitoring your agent in production
- Revising your test cases, steps, and evaluators based on production data
- Repeating steps 4-6

Building a set of test cases
Having a standard set of test cases allows us to test changes to our agent, avoid unexpected regressions, and provides a benchmark for evaluation.
This set doesn’t need to be long but should be comprehensive. For example, if you’re building a chatbot agent for your website, your test cases should include all types of queries the agent supports. This might involve queries that trigger each of your functions or skills, some general information queries, and a set of off-topic queries that your agent should not answer.
Your test cases should cover all the paths your agent can take. You don’t need thousands of test cases if they are just rephrasings of other cases.
Test cases will evolve over time as you discover new types of inputs. Don’t be afraid to add to them, even if doing so makes some evaluation results less comparable to historical runs.
Choosing which steps of your agent to evaluate
Next, we split our agent into manageable steps that we want to evaluate individually. These steps should be fairly granular, encompassing individual operations.
Each of your agent’s functions, skills, or execution branches should have some form of evaluation to benchmark their performance. You can get as granular as you’d like. For example, you could evaluate the retrieval step of a RAG skill or the response of an internal API call.
Beyond the skills, it’s critical to evaluate the router on a few axes, which we’ll touch on below. If you’re using a router, this is often where the biggest performance gains can be achieved.
As you can see, the list of “pieces” to evaluate can grow quickly, but that’s not necessarily a bad thing. We recommend starting with many evaluations and trimming down over time, especially if you’re new to agent development
Evaluating Agent Skills
With our steps defined, we can now build evaluators for each one. Many frameworks, including Arize’s Phoenix library, can help with this. You can also code your own evaluations, which can be as simple as a string comparison depending on the type of evaluation.
Generally, evaluation can be performed by either comparing outputs to expected outputs or using a separate LLM as a judge. The former approach is great if you have expected outputs to compare to, as it is a deterministic approach. LLM-as-a-judge is helpful when there is no ground truth or when you’re aiming for more qualitative evaluation.
Evaluating the skill steps of an agent is similar to evaluating those skills outside of the agent. If your agent has a RAG skill, for example, you would still evaluate both the retrieval and response generation steps, calculating metrics like document relevance and hallucinations in the response.
Common Skill Evaluations
For RAG Skills:
- Retrieval Relevance
- QA Correctness
- Hallucination
- Reference / Citation
For Code-Gen Skills:
- Code readability
- Code correctness
For API Skills:
- Code-based integration tests and unit tests
For All skills:
- Comparison against ground truth data
- For more on skill evaluations, see our LLM Evaluation handbook.
Evaluating a Router
Beyond skills, agent evaluation becomes more unique.
In addition to evaluating the agent’s skills, you need to evaluate the router and the path the agent takes.
The router should be evaluated on two axes: first, its ability to choose the right skill or function for a given input; second, its ability to extract the right parameters from the input to populate the function call.
Choosing the right skill is perhaps the most important task and one of the most difficult. This is where your router prompt (if you have one) will be put to the test. Low scores at this stage usually stem from a poor router prompt or unclear function descriptions, both of which are challenging to improve.

Extracting the right parameters is also tricky, especially when parameters overlap. Consider adding some curveballs into your test cases, like a user asking for an order status while providing a shipping tracking number, to stress-test your agent.
Arize provides built-in evaluators to measure tool call accuracy using an LLM as a judge, which can assist at this stage.
Evaluating your Agent’s Path
Lastly, evaluate the path the agent takes during execution. Does it repeat steps? Get stuck in loops? Return to the router unnecessarily? These “path errors” can cause the worst bugs in agents.
The simplest way to evaluate path is by adding an iteration counter as an evaluation. Tracking the number of steps it takes for the agent to complete different types of queries can provide a useful statistic.
This can then be extend to measure Convergence.
Convergence
Convergence in the context of agents refers to how often your agent takes the optimal path for a given query. Is your agent “converging” towards a critical path for a given query? Convergence allows you to measure this.
To calculate this numeric score:
- Run your agent on a set of similar queries
- Record the number of steps taken for each, as well as the overall minimum number of steps for a given run
- Calculate the convergence score: ∑ (minimum steps taken for this type of query / steps in the run)
This will give you a numeric 0-1 value of how often your agent is taking the optimal path for that type of query, and by how much it diverges when it takes a suboptimal path.
It’s important to note however that because the optimal path is calculated by the shortest run of your agent, this technique will miss cases where every run of your agent takes a suboptimal path.
Experimenting and Iterating
With your evaluators and test cases defined, you’re ready to modify your agent. After each major modification, run your test cases through the agent, then run each of your evaluators on the output or traces. You can track this process manually or use Arize to track experiments and evaluation results.
While there’s no one-size-fits-all approach to improving your agent, this evaluation framework gives you much greater visibility into your agent’s behavior and performance. It will also give you peace of mind that changes haven’t inadvertently broken other parts of your application.
Congratulations, you’ve reached the end of Arize’s Agent Handbook! 🎉
You’ve learned what agents are, how they’re made, different agent architectures and organization structures, and how to evaluate your agents.
If you have any questions, or just want to share what you’re working on, come join our team in Slack.
Now, go forth and build!