Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
AutoGen is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.
Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.
UserProxyAgent
: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.
AssisstantAgent
: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.
GroupChat
: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.
GroupChatManager
: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.
Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.
Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.
Agent Roles
Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.
Termination Conditions
GroupChat
may continue even after a logical end, as UserProxyAgent
can exhaust all allowed turns before stopping unless termination is explicitly triggered.
Human-in-the-Loop
Fully autonomous mode may miss important judgment calls without user oversight.
State Management
Excessive context can exceed token limits, while insufficient context breaks coherence.
Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.
AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent
focuses on a specialized task, while a UserProxyAgent
manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.
Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.
How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps &#xNAN;(e.g., do identified trends inform the company evaluation?)
Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis
Track total latency and token counts to see which steps cause inefficiencies
Ensure there are no redundant outputs or hallucinations in multi-step reasoning
Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.
AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.
Notebook: Customer Service Routing Agent
We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent
.
How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category &#xNAN;(e.g., billing, technical support, product info)
Confirm that each query is routed to the appropriate specialized AssistantAgent
without ambiguity or misdirection
Test with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately
Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents
The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.
AutoGen's GroupChat
architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager
facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.
Notebook: Code Generator with Evaluation Loop
We'll use a Code_Generator
agent to write Python code from requirements, and a Code_Reviewer
agent to assess it for correctness, style, and documentation. This iterative GroupChat
process improves code quality through a generation and review loop.
How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria &#xNAN;(e.g., correctness, style, documentation)
Confirm that the generator incorporates feedback into meaningful revisions with each iteration
Track the number of iterations required to reach an acceptable or final version to assess efficiency
Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process
Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.
Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager
) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.
Notebook: Trip Planner Orchestrator Agent
We will build a dynamic travel planning assistant. A GroupChatManager
coordinates specialized agents to adapt to the user's evolving travel needs.
How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need.
&#xNAN;(e.g., flights, hotels, local activities)
Confirm that agents contribute meaningfully and only when their domain expertise is required
Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents
Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically
Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.
AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.
Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.
How to evaluate: Ensure each parallel agent call produces a distinct and relevant component &#xNAN;(e.g., features, value proposition, target customer, tagline)
Confirm that all outputs are successfully collected and synthesized into a cohesive final product description
Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution
Test with varying product types to assess generality and stability of the parallel workflow
CrewAI is an open-source framework for building and orchestrating collaborative AI agents that act like a team of specialized virtual employees. Built on LangChain, it enables users to define roles, goals, and workflows for each agent, allowing them to work together autonomously on complex tasks with minimal setup.
Agents are autonomous, role-driven entities designed to perform specific functions—like a Researcher, Writer, or Support Rep. They can be richly customized with goals, backstories, verbosity settings, delegation permissions, and access to tools. This flexibility makes agents expressive and task-aware, helping model real-world team dynamics.
Tasks are the atomic units of work in CrewAI. Each task includes a description, expected output, responsible agent, and optional tools. Tasks can be executed solo or collaboratively, and they serve as the bridge between high-level goals and actionable steps.
Tools give agents capabilities beyond language generation—such as browsing the web, fetching documents, or performing calculations. Tools can be native or developer-defined using the BaseTool
class, and each must have a clear name and purpose so agents can invoke them appropriately.Tools must include clear descriptions to help agents use them effectively.
CrewAI supports multiple orchestration strategies:
Sequential: Tasks run in a fixed order—simple and predictable.
Hierarchical: A manager agent or LLM delegates tasks dynamically, enabling top-down workflows.
Consensual (planned): Future support for democratic, collaborative task routing. Each process type shapes how coordination and delegation unfold within a crew.
A crew is a collection of agents and tasks governed by a defined process. It represents a fully operational unit with an execution strategy, internal collaboration logic, and control settings for verbosity and output formatting. Think of it as the operating system for multi-agent workflows.
Pipelines chain multiple crews together, enabling multi-phase workflows where the output of one crew becomes the input to the next. This allows developers to modularize complex applications into reusable, composable segments of logic.
With planning enabled, CrewAI generates a task-by-task strategy before execution using an AgentPlanner. This enriches each task with context and sequencing logic, improving coordination—especially in multi-step or loosely defined workflows.
Agent Roles
Explicit role configuration gives flexibility, but poor design can cause overlap or miscommunication
State Management
Stateless by default. Developers must implement external state or context passing for continuity across tasks
Task Planning
Supports sequential and branching workflows, but all logic must be manually defined—no built-in planning
Tool Usage
Agents support tools via config. No automatic selection; all tool-to-agent mappings are manual
Termination Logic
No auto-termination handling. Developers must define explicit conditions to break recursive or looping behavior
Memory
No built-in memory layer. Integration with vector stores or databases must be handled externally
Prompt chaining decomposes a complex task into a sequence of smaller steps, where each LLM call operates on the output of the previous one. This workflow introduces the ability to add programmatic checks (such as “gates”) between steps, validating intermediate outputs before continuing. The result is higher control, accuracy, and debuggability—at the cost of increased latency.
CrewAI makes it straightforward to build prompt chaining workflows using a sequential process. Each step is modeled as a Task
, assigned to a specialized Agent
, and executed in order using Process.sequential
. You can insert validation logic between tasks or configure agents to flag issues before passing outputs forward.
Notebook: Research-to-Content Prompt Chaining Workflow
Routing is a pattern designed to classify incoming requests and dispatch them to the single most appropriate specialist agent or workflow, ensuring each input is handled by a focused, expert-driven routine.
In CrewAI, you implement routing by defining a Router Agent that inspects each input, emits a category label, and then dynamically delegates to downstream agents (or crews) tailored for that category—each equipped with its own tools and prompts. This separation of concerns delivers more accurate, maintainable pipelines.
Notebook: Research-Content Routing Workflow
Parallelization is a powerful agent workflow where multiple tasks are executed simultaneously, enabling faster and more scalable LLM pipelines. This pattern is particularly effective when tasks are independent and don’t depend on each other’s outputs.
While CrewAI does not enforce true multithreaded execution, it provides a clean and intuitive structure for defining parallel logic through multiple agents and tasks. These can be executed concurrently in terms of logic, and then gathered or synthesized by a downstream agent.
Notebook: Parallel Research Agent
The Orchestrator-Workers workflow centers around a primary agent—the orchestrator—that dynamically decomposes a complex task into smaller, more manageable subtasks. Rather than relying on a fixed structure or pre-defined subtasks, the orchestrator decides what needs to be done based on the input itself. It then delegates each piece to the most relevant worker agent, often specialized in a particular domain like research, content synthesis, or evaluation.
CrewAI supports this pattern using the Process.hierarchical
setup, where the orchestrator (as the manager agent) generates follow-up task specifications at runtime. This enables dynamic delegation and coordination without requiring the workflow to be rigidly structured up front. It's especially useful for use cases like multi-step research, document generation, or problem-solving workflows where the best structure only emerges after understanding the initial query.
Notebook: Research & Writing Delegation Agents
SmolAgents is a lightweight Python library for composing tool-using, task-oriented agents. This guide outlines common agent workflows we've implemented—covering routing, evaluation loops, task orchestration, and parallel execution. For each pattern, we include an overview, a reference notebook, and guidance on how to evaluate agent quality.
While the API is minimal—centered on Agent
, Task
, and Tool
—there are important tradeoffs and design constraints to be aware of.
This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.
Notebook: The agent first extracts keywords from a resume, then summarizes what those keywords suggest.
How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do summaries reflect the extracted keywords?)
Check if the intermediate step (e.g. keyword extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure
Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated agent, often using lightweight classification.
Notebook: The agent classifies candidate profiles into Software, Product, or Design categories, then hands them off to the appropriate evaluation pipeline.
How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right department for a given candidate?)
Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors (e.g., ambiguous or mixed-signal profiles)
This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.
Notebook: An agent writes a candidate rejection email. If the evaluator agent finds the tone or feedback lacking, it asks for a revision.
How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the message respectful, clear, and specific?)
Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability
In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.
Notebook: The orchestrator delegates resume review, culture fit assessment, and decision-making to different agents, then composes a final recommendation.
How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final recommendation align with the inputs from each worker agent?)
Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components (e.g., technical fit vs. recommendation)
When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.
Notebook:
Candidate reviews are distributed using asyncio
, enabling faster batch processing without compromising output quality.
How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are profiles processed correctly and faster when run in parallel?)
Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency
This guide explains key LangGraph concepts, discusses design considerations, and walks through common architectural patterns like orchestrator-worker, evaluators, and routing. Each pattern includes a brief explanation and links to runnable Python notebooks.
LangGraph allows you to build LLM-powered applications using a graph of steps (called "nodes") and data (called "state"). Here's what you need to know to understand and customize LangGraph workflows:
A TypedDict
that stores all information passed between nodes. Think of it as the memory of your workflow. Each node can read from and write to the state.
Nodes are units of computation. Most often these are functions that accept a State
input and return a partial update to it. Nodes can do anything: call LLMs, trigger tools, perform calculations, or prompt users.
Directed connections that define the order in which nodes are called. LangGraph supports linear, conditional, and cyclical edges, which allows for building loops, branches, and recovery flows.
A Python function that examines the current state and returns the name of the next node to call. This allows your application to respond dynamically to LLM outputs, tool results, or even human input.
A way to dynamically launch multiple workers (nodes or subgraphs) in parallel, each with their own state. Often used in orchestrator-worker patterns where the orchestrator doesn't know how many tasks there will be ahead of time.
LangGraph enables complex multi-agent orchestration using a Supervisor node that decides how to delegate tasks among a team of agents. Each agent can have its own tools, prompt structure, and output format. The Supervisor coordinates routing, manages retries, and ensures loop control.
LangGraph supports built-in persistence using checkpointing. Each execution step saves state to a database (in-memory, SQLite, or Postgres). This allows for:
Multi-turn conversations (memory)
Rewinding to past checkpoints (time travel)
Human-in-the-loop workflows (pause + resume)
LangGraph improves on LangChain by supporting more flexible and complex workflows. Here’s what to keep in mind when designing:
A linear sequence of prompt steps, where the output of one becomes the input to the next. This workflow is optimal when the task can be simply broken down into concrete subtasks.
Use case: Multistep reasoning, query rewriting, or building up answers gradually.
📓
Runs multiple LLMs in parallel — either by splitting tasks (sectioning) or getting multiple opinions (voting).
Use case: Combining diverse outputs, evaluating models from different angles, or running safety checks.
With the Send
API, LangGraph lets you:
Launch multiple safety evaluators in parallel
Compare multiple generated hypotheses side-by-side
Run multi-agent voting workflows
This improves reliability and reduces bottlenecks in linear pipelines.
📓
Routes an input to the most appropriate follow-up node based on its type or intent.
Use case: Customer support bots, intent classification, or model selection.
LangGraph routers enable domain-specific delegation — e.g., classify an incoming query as "billing", "technical support", or "FAQ", and send it to a specialized sub-agent. Each route can have its own tools, memory, and context. Use structured output with a routing schema to make classification more reliable.
📓
One LLM generates content, another LLM evaluates it, and the loop repeats until the evaluation passes. LangGraph allows feedback to modify the state, making each round better than the last.
Use case: Improving code, jokes, summaries, or any generative output with measurable quality.
📓
An orchestrator node dynamically plans subtasks and delegates each to a worker LLM. Results are then combined into a final output.
Use case: Writing research papers, refactoring code, or composing modular documents.
LangGraph’s Send
API lets the orchestrator fork off tasks (e.g., subsections of a paper) and gather them into completed_sections
. This is especially useful when the number of subtasks isn’t known in advance.
You can also incorporate agents like PDF_Reader
or a WebSearcher
, and the orchestrator can choose when to route to these workers.
⚠️ Caution: Feedback loops or improper edge handling can cause workers to echo each other or create infinite loops. Use strict conditional routing to avoid this.
📓
API centered on Agent
, Task
, and Tool
Tools are just Python functions decorated with @tool
. There’s no centralized registry or schema enforcement, so developers must define conventions and structure on their own.
Provides flexibility for orchestration
No retry mechanism or built-in workflow engine
Supports evaluator-optimizer loops, routing, and fan-out/fan-in
Agents are composed, not built-in abstractions
Must implement orchestration logic
Multi-Agent support
No built-in support for collaboration structures like voting, planning, or debate.
Token-level streaming is not supported
No state or memory management out of the box. Applications that require persistent state—such as conversations or multi-turn workflows—will need to integrate external storage (e.g., a vector database or key-value store).
There’s no native memory or “trajectory” tracking between agents. Handoffs between tasks are manual. This is workable in small systems, but may require structure in more complex workflows.
Cyclic workflows: LangGraph supports loops, retries, and iterative workflows that would be cumbersome in LangChain.
Debugging complexity: Deep graphs and multi-agent networks can be difficult to trace. Use Arize AX or Phoenix!
Fine-grained control: Customize prompts, tools, state updates, and edge logic for each node.
Token bloat: Cycles and retries can accumulate state and inflate token usage.
Visualize: Graph visualization makes it easier to follow logic flows and complex routing.
Requires upfront design: Graphs must be statically defined before execution. No dynamic graph construction mid-run.
Supports multi-agent coordination: Easily create agent networks with Supervisor and worker roles.
Supervisor misrouting: If not carefully designed, supervisors may loop unnecessarily or reroute outputs to the wrong agent.
Google's GenAI SDK is a framework designed to help you interact with Gemini models and models run through VertexAI. Out of all the frameworks detailed in this guide, GenAI SDK is the closest to a base model SDK. While it does provide helpful functions and concepts to streamline tool calling, structured output, and passing files, it does not approach the level of abstraction of frameworks like CrewAI or Autogen.
In April 2025, Google launched its ADK framework, which is a more comparable agent orchestration framework to the others on this list.
That said, because of the relative simplicity of the GenAI SDK, this guide serves as a good learning tool to show how some of the common agent patterns can be manually implemented.
GenAI SDK uses contents
to represent user messages, files, system messages, function calls, and invocation parameters. That creates relatively simple generation calls:
file = client.files.upload(file='a11.txt')
response = client.models.generate_content(
model='gemini-2.0-flash-001',
contents=['Could you summarize this file?', file]
)
print(response.text)
Content objections can also be composed together in a list:
[
types.UserContent(
parts=[
types.Part.from_text('What is this image about?'),
types.Part.from_uri(
file_uri: 'gs://generativeai-downloads/images/scones.jpg',
mime_type: 'image/jpeg',
)
]
)
]
Google GenAI does not include built in orchestration patterns.
GenAI has no concept of handoffs natively.
State is handled by maintaining a list of previous messages and other data in a list of content objections. This is similar to how other model SDKs like OpenAI and Anthropic handle the concept of state. This stands in contrast to the more sophisticated measurements of state present in agent orchestration frameworks.
GenAI does include some conveience features around tool calling. The types.GenerateContentConfig
method can automatically convert base python functions into signatures. To do this, the SDK will use the function docstring to understand its purpose and arguments.
def get_current_weather(location: str) -> str:
"""Returns the current weather.
Args:
location: The city and state, e.g. San Francisco, CA
"""
return 'sunny'
response = client.models.generate_content(
model='gemini-2.0-flash-001',
contents='What is the weather like in Boston?',
config=types.GenerateContentConfig(tools=[get_current_weather]),
)
print(response.text)
GenAI will also automatically call the function and incorporate its return value. This goes a step beyond what similar model SDKs do on other platforms. This behavior can be disabled.
GenAI has no built-in concept of memory.
GenAI has no built-in collaboration strategies. These must be defined manually.
GenAI supports streaming of both text and image responses:
for chunk in client.models.generate_content_stream(
model='gemini-2.0-flash-001', contents='Tell me a story in 300 words.'
):
print(chunk.text, end='')
GenAI is the "simplest" framework in this guide, and is closer to a pure model SDK like the OpenAI SDK, rather than an agent framework. It does go a few steps beyond these base SDKs however, notably in tool calling. It is a good option if you're using Gemini models, and want more direct control over your agent system.
Content approach streamlines message management
No built-in orchestration capabilities
Supports automatic tool calling
No state or memory management
Allows for all agent patterns, but each must be manually set up
Primarily designed to work with Gemini models
This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.
Notebook: Research Agent The agent first researches a topic, then provides an executive summary of its results, then finally recommends future focus directions.
How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do key points reflect the original research?)
Check if the intermediate step (e.g. key point extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure
Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated call, often using lightweight classification.
Notebook: Simple Tool Router This agent shows a simple example of routing use inputs to different tools.
How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right tool for a given input?)
Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors
This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.
Notebook: Story Writing Agent An agent generates an initial draft of a story, then a critique agent decides whether the quality is high enough. If not, it asks for a revision.
How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the story engaging, clear, and well-written?)
Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability
In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.
Notebook: Travel Planning Agent The orchestrator delegates planning a trip for a user, and incorporates a user proxy to improve its quality. The orchestrator delegates to specific functions to plan flights, hotels, and provide general travel recommendations.
How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final output align with the inputs from each worker agent?)
Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components
When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.
Notebook: Parallel Research Agent Multiple research topics are examined simultaneously. Once all are complete, the topics are then synthesized into a final combined report.
How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are topics processed correctly and faster when run in parallel?)
Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency
Workflows are the backbone of many successful LLM applications. They define how language models interact with tools, data, and users—often through a sequence of clearly orchestrated steps. Unlike fully autonomous agents, workflows offer structure and predictability, making them a practical choice for many real-world tasks.
In this guide, we share practical workflows using a variety of agent frameworks, including:
Each section highlights how to use these tools effectively—showing what’s possible, where they shine, and where a simpler solution might serve you better. Whether you're orchestrating deterministic workflows or building dynamic agent systems, the goal is to help you choose the right tool for your context and build with confidence.
For a deeper dive into the principles behind agentic systems and when to use them, see Anthropic’s “Building Effective Agents”.
Agent Routing is the process of directing a task, query, or request to the most appropriate agent based on context or capabilities. In multi-agent systems, it helps determine which agent is best suited to handle a specific input based on skills, domain expertise, or available tools. This enables more efficient, accurate, and specialized handling of complex tasks.
Prompt Chaining is the technique of breaking a complex task into multiple steps, where the output of one prompt becomes the input for the next. This allows a system to reason more effectively, maintain context across steps, and handle tasks that would be too difficult to solve in a single prompt. It's often used to simulate multi-step thinking or workflows.
Parallelization is the process of dividing a task into smaller, independent parts that can be executed simultaneously to speed up processing. It’s used to handle multiple inputs, computations, or agent responses at the same time rather than sequentially. This improves efficiency and speed, especially for large-scale or time-sensitive tasks.
An orchestrator is a central controller that manages and coordinates multiple components, agents, or processes to ensure they work together smoothly.
It decides what tasks need to be done, who or what should do them, and in what order. An orchestrator can handle things like scheduling, routing, error handling, and result aggregation. It might also manage prompt chains, route tasks to agents, and oversee parallel execution.
An evaluator assesses the quality or correctness of outputs, such as ranking responses, checking for factual accuracy, or scoring performance against a metric. An optimizer uses that evaluation to improve future outputs, either by fine-tuning models, adjusting parameters, or selecting better strategies. Together, they form a feedback loop that helps a system learn what works and refine itself over time.
OpenAI-Agents is a lightweight Python library for building agentic AI apps. It includes a few abstractions:
Agents, which are LLMs equipped with instructions and tools
Handoffs, which allow agents to delegate to other agents for specific tasks
Guardrails, which enable the inputs to agents to be validated
This guide outlines common agent workflows using this SDK. We will walk through building an investment agent across several use cases.
from agents import Agent, Runner, WebSearchTool
agent = Agent(
name="Finance Agent",
instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
tools=[WebSearchTool()],
model="gpt-4.1-mini",
)
Model support
First class support for OpenAI LLMs, and basic support for any LLM using a LiteLLM wrapper. Support for reasoning effort parameter to tradeoff on reducing latency or increasing accuracy.
Structured outputs
First-class support with OpenAI LLMs. LLMs that do not support json_schema
as a parameter are .
Tools
Very easy, using the @function_call
decorator. Support for parallel tool calls to reduce latency. Built-in support for OpenAI SDK for WebSearchTool
, ComputerTool
, and FileSearchTool
Agent handoff
Very easy using handoffs
variable
Multimodal support
Voice support, no support for images or video
Guardrails
Enables validation of both inputs and outputs
Retry logic
⚠️ No retry logic, developers must manually handle failure cases
Memory
⚠️ No built-in memory management. Developers must manage their own conversation and user memory.
Code execution
⚠️ No built-in support for executing code
An LLM agent with access to tools to accomplish a task is the most basic flow. This agent answers questions about stocks and uses OpenAI web search to get real time information.
from agents import Agent, Runner, WebSearchTool
agent = Agent(
name="Finance Agent",
instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
tools=[WebSearchTool()],
model="gpt-4.1-mini",
)
This agent builds a portfolio of stocks and ETFs using multiple agents linked together:
Search Agent: Searches the web for information on particular stock tickers.
Report Agent: Creates a portfolio of stocks and ETFs that supports the user's investment strategy.
portfolio_agent = Agent(
name="Portfolio Agent",
instructions="You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation.",
model="o4-mini",
output_type=Portfolio,
)
research_agent = Agent(
name="FinancialSearchAgent",
instructions="You are a research assistant specializing in financial topics. Given an investment strategy, use web search to retrieve up‑to‑date context and produce a short summary of stocks that support the investment strategy at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
model="gpt-4.1",
tools=[WebSearchTool()],
model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)
This agent researches stocks for you. If we want to research 5 stocks, we can force the agent to run multiple tool calls, instead of sequentially.
@function_tool
def get_stock_data(ticker_symbol: str) -> dict:
"""
Get stock data for a given ticker symbol.
Args:
ticker_symbol: The ticker symbol of the stock to get data for.
Returns:
A dictionary containing stock data such as price, market cap, and more.
"""
import yfinance as yf
stock = yf.Ticker(ticker_symbol)
return stock.info
research_agent = Agent(
name="FinancialSearchAgent",
instructions=dedent(
"""You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
),
model="gpt-4.1",
tools=[WebSearchTool(), get_stock_data_tool],
model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)
This agent answers questions about investing using multiple agents. A central router agent chooses which worker to use.
Research Agent: Searches the web for information about stocks and ETFs.
Question Answering Agent: Answers questions about investing like Warren Buffett.
qa_agent = Agent(
name="Investing Q&A Agent",
instructions="You are Warren Buffett. You are answering questions about investing.",
model="gpt-4.1",
)
research_agent = Agent(
name="Financial Search Agent",
instructions="You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
model="gpt-4.1",
tools=[WebSearchTool()],
)
orchestrator_agent = Agent(
name="Routing Agent",
instructions="You are a senior financial analyst. Your task is to handoff to the appropriate agent or tool.",
model="gpt-4.1",
handoffs=[research_agent,qa_agent],
)
When creating LLM outputs, often times the first generation is unsatisfactory. You can use an agentic loop to iteratively improve the output by asking an LLM to give feedback, and then use the feedback to improve the output.
This agent pattern creates reports and evaluates itself to improve its output.
Report Agent (Generation): Creates a report on a particular stock ticker.
Evaluator Agent (Feedback): Evaluates the report and provides feedback on what to improve.
class EvaluationFeedback(BaseModel):
feedback: str = Field(
description=f"What is missing from the research report on positive and negative catalysts for a particular stock ticker. Catalysts include changes in {CATALYSTS}.")
score: Literal["pass", "needs_improvement", "fail"] = Field(
description="A score on the research report. Pass if the report is complete and contains at least 3 positive and 3 negative catalysts for the right stock ticker, needs_improvement if the report is missing some information, and fail if the report is completely wrong.")
report_agent = Agent(
name="Catalyst Report Agent",
instructions=dedent(
"""You are a research assistant specializing in stock research. Given a stock ticker, generate a report of 3 positive and 3 negative catalysts that could move the stock price in the future in 50 words or less."""
),
model="gpt-4.1",
)
evaluation_agent = Agent(
name="Evaluation Agent",
instructions=dedent(
"""You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
),
model="gpt-4.1",
output_type=EvaluationFeedback,
)
This is the most advanced pattern in the examples, using orchestrators and workers together. The orchestrator chooses which worker to use for a specific sub-task. The worker attempts to complete the sub-task and return a result. The orchestrator then uses the result to choose the next worker to use until a final result is returned.
In the following example, we'll build an agent which creates a portfolio of stocks and ETFs based on a user's investment strategy.
Orchestrator: Chooses which worker to use based on the user's investment strategy.
Research Agent: Searches the web for information about stocks and ETFs that could support the user's investment strategy.
Evaluation Agent: Evaluates the research report and provides feedback on what data is missing.
Portfolio Agent: Creates a portfolio of stocks and ETFs based on the research report.
evaluation_agent = Agent(
name="Evaluation Agent",
instructions=dedent(
"""You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
),
model="gpt-4.1",
output_type=EvaluationFeedback,
)
portfolio_agent = Agent(
name="Portfolio Agent",
instructions=dedent(
"""You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation."""
),
model="o4-mini",
output_type=Portfolio,
)
research_agent = Agent(
name="FinancialSearchAgent",
instructions=dedent(
"""You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
),
model="gpt-4.1",
tools=[WebSearchTool()],
model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)
orchestrator_agent = Agent(
name="Routing Agent",
instructions=dedent("""You are a senior financial analyst. You are trying to create a portfolio based on my stated investment strategy. Your task is to handoff to the appropriate agent or tool.
First, handoff to the research_agent to give you a report on stocks and ETFs that could support the user's stated investment strategy.
Then, handoff to the evaluation_agent to give you a score on the research report. If the evaluation_agent returns a needs_improvement or fail, continue using the research_agent to gather more information.
Once the evaluation_agent returns a pass, handoff to the portfolio_agent to create a portfolio."""),
model="gpt-4.1",
handoffs=[
research_agent,
evaluation_agent,
portfolio_agent,
],
)
This uses the following structured outputs.
class PortfolioItem(BaseModel):
ticker: str = Field(description="The ticker of the stock or ETF.")
allocation: float = Field(
description="The percentage allocation of the ticker in the portfolio. The sum of all allocations should be 100."
)
reason: str = Field(description="The reason why this ticker is included in the portfolio.")
class Portfolio(BaseModel):
tickers: list[PortfolioItem] = Field(
description="A list of tickers that could support the user's stated investment strategy."
)
class EvaluationFeedback(BaseModel):
feedback: str = Field(
description="What data is missing in order to create a portfolio of stocks and ETFs based on the user's investment strategy."
)
score: Literal["pass", "needs_improvement", "fail"] = Field(
description="A score on the research report. Pass if you have at least 5 tickers with data that supports the user's investment strategy to create a portfolio, needs_improvement if you do not have enough supporting data, and fail if you have no tickers."
)