Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Use Phoenix to trace and evaluate AutoGen agents
AutoGen is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.
Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.
UserProxyAgent
: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.
AssisstantAgent
: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.
GroupChat
: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.
GroupChatManager
: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.
Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.
Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.
Agent Roles
Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.
Termination Conditions
GroupChat
may continue even after a logical end, as UserProxyAgent
can exhaust all allowed turns before stopping unless termination is explicitly triggered.
Human-in-the-Loop
Fully autonomous mode may miss important judgment calls without user oversight.
State Management
Excessive context can exceed token limits, while insufficient context breaks coherence.
Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.
AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent
focuses on a specialized task, while a UserProxyAgent
manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.
Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.
How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps (e.g., do identified trends inform the company evaluation?)
Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis
Track total latency and token counts to see which steps cause inefficiencies
Ensure there are no redundant outputs or hallucinations in multi-step reasoning
Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.
AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.
Notebook: Customer Service Routing Agent
We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent
.
How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category (e.g., billing, technical support, product info)
Confirm that each query is routed to the appropriate specialized AssistantAgent
without ambiguity or misdirection
Test with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately
Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents
The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.
AutoGen's GroupChat
architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager
facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.
Notebook: Code Generator with Evaluation Loop
We'll use a Code_Generator
agent to write Python code from requirements, and a Code_Reviewer
agent to assess it for correctness, style, and documentation. This iterative GroupChat
process improves code quality through a generation and review loop.
How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria (e.g., correctness, style, documentation)
Confirm that the generator incorporates feedback into meaningful revisions with each iteration
Track the number of iterations required to reach an acceptable or final version to assess efficiency
Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process
Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.
Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager
) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.
Notebook: Trip Planner Orchestrator Agent
We will build a dynamic travel planning assistant. A GroupChatManager
coordinates specialized agents to adapt to the user's evolving travel needs.
How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need.
(e.g., flights, hotels, local activities)
Confirm that agents contribute meaningfully and only when their domain expertise is required
Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents
Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically
Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.
AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.
Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.
How to evaluate: Ensure each parallel agent call produces a distinct and relevant component (e.g., features, value proposition, target customer, tagline)
Confirm that all outputs are successfully collected and synthesized into a cohesive final product description
Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution
Test with varying product types to assess generality and stability of the parallel workflow
SmolAgents is a lightweight Python library for composing tool-using, task-oriented agents. This guide outlines common agent workflows we've implemented—covering routing, evaluation loops, task orchestration, and parallel execution. For each pattern, we include an overview, a reference notebook, and guidance on how to evaluate agent quality.
While the API is minimal—centered on Agent
, Task
, and Tool
—there are important tradeoffs and design constraints to be aware of.
API centered on Agent
, Task
, and Tool
Tools are just Python functions decorated with @tool
. There’s no centralized registry or schema enforcement, so developers must define conventions and structure on their own.
Provides flexibility for orchestration
No retry mechanism or built-in workflow engine
Supports evaluator-optimizer loops, routing, and fan-out/fan-in
Agents are composed, not built-in abstractions
Must implement orchestration logic
Multi-Agent support
No built-in support for collaboration structures like voting, planning, or debate.
Token-level streaming is not supported
No state or memory management out of the box. Applications that require persistent state—such as conversations or multi-turn workflows—will need to integrate external storage (e.g., a vector database or key-value store).
There’s no native memory or “trajectory” tracking between agents. Handoffs between tasks are manual. This is workable in small systems, but may require structure in more complex workflows.
This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.
Notebook: Prompt Chaining with Keyword Extraction + Summarization The agent first extracts keywords from a resume, then summarizes what those keywords suggest.
How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do summaries reflect the extracted keywords?)
Check if the intermediate step (e.g. keyword extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure
Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated agent, often using lightweight classification.
Notebook: Candidate Interview Router The agent classifies candidate profiles into Software, Product, or Design categories, then hands them off to the appropriate evaluation pipeline.
How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right department for a given candidate?)
Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors (e.g., ambiguous or mixed-signal profiles)
This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.
Notebook: Rejection Email Generator with Evaluation Loop An agent writes a candidate rejection email. If the evaluator agent finds the tone or feedback lacking, it asks for a revision.
How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the message respectful, clear, and specific?)
Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability
In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.
Notebook: Recruiting Evaluator Orchestrator The orchestrator delegates resume review, culture fit assessment, and decision-making to different agents, then composes a final recommendation.
How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final recommendation align with the inputs from each worker agent?)
Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components (e.g., technical fit vs. recommendation)
When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.
Notebook:
Candidate reviews are distributed using asyncio
, enabling faster batch processing without compromising output quality.
How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are profiles processed correctly and faster when run in parallel?)
Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency