AutoGen
Use Phoenix to trace and evaluate AutoGen agents
AutoGen is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.
Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.
AutoGen Core Concepts
UserProxyAgent
: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.AssisstantAgent
: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.GroupChat
: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.GroupChatManager
: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.
Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.
Design Considerations and Limitations
Agent Roles
Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.
Termination Conditions
GroupChat
may continue even after a logical end, as UserProxyAgent
can exhaust all allowed turns before stopping unless termination is explicitly triggered.
Human-in-the-Loop
Fully autonomous mode may miss important judgment calls without user oversight.
State Management
Excessive context can exceed token limits, while insufficient context breaks coherence.
Prompt Chaining

Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.
AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent
focuses on a specialized task, while a UserProxyAgent
manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.
Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.
How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps (e.g., do identified trends inform the company evaluation?)
Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis
Track total latency and token counts to see which steps cause inefficiencies
Ensure there are no redundant outputs or hallucinations in multi-step reasoning
Routing

Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.
AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.
Notebook: Customer Service Routing Agent
We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent
.
How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category (e.g., billing, technical support, product info)
Confirm that each query is routed to the appropriate specialized
AssistantAgent
without ambiguity or misdirectionTest with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately
Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents
Evaluator–Optimizer Loop

The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.
AutoGen's GroupChat
architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager
facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.
Notebook: Code Generator with Evaluation Loop
We'll use a Code_Generator
agent to write Python code from requirements, and a Code_Reviewer
agent to assess it for correctness, style, and documentation. This iterative GroupChat
process improves code quality through a generation and review loop.
How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria (e.g., correctness, style, documentation)
Confirm that the generator incorporates feedback into meaningful revisions with each iteration
Track the number of iterations required to reach an acceptable or final version to assess efficiency
Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process
Orchestrator Pattern

Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.
Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager
) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.
Notebook: Trip Planner Orchestrator Agent
We will build a dynamic travel planning assistant. A GroupChatManager
coordinates specialized agents to adapt to the user's evolving travel needs.
How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need.
(e.g., flights, hotels, local activities)
Confirm that agents contribute meaningfully and only when their domain expertise is required
Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents
Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically
Parallel Agent Execution

Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.
AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.
Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.
How to evaluate: Ensure each parallel agent call produces a distinct and relevant component (e.g., features, value proposition, target customer, tagline)
Confirm that all outputs are successfully collected and synthesized into a cohesive final product description
Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution
Test with varying product types to assess generality and stability of the parallel workflow
Last updated
Was this helpful?