AutoGen

Use Phoenix to trace and evaluate AutoGen agents

AutoGen is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.

Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.

AutoGen Core Concepts

UserProxyAgent: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.
AssisstantAgent: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.
GroupChat: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.
GroupChatManager: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.
Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.
Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.

Design Considerations and Limitations

Design Consideration

Limitations

Agent Roles

Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.

Termination Conditions

GroupChat may continue even after a logical end, as UserProxyAgent can exhaust all allowed turns before stopping unless termination is explicitly triggered.

Human-in-the-Loop

Fully autonomous mode may miss important judgment calls without user oversight.

State Management

Excessive context can exceed token limits, while insufficient context breaks coherence.

Prompt Chaining

Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.

AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent focuses on a specialized task, while a UserProxyAgent manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.

Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.

How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps (e.g., do identified trends inform the company evaluation?)

Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis
Track total latency and token counts to see which steps cause inefficiencies
Ensure there are no redundant outputs or hallucinations in multi-step reasoning

Routing

Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.

AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.

Notebook: Customer Service Routing Agent We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent .

How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category (e.g., billing, technical support, product info)

Confirm that each query is routed to the appropriate specialized AssistantAgent without ambiguity or misdirection
Test with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately
Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents

Evaluator–Optimizer Loop

The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.

AutoGen's GroupChat architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.

Notebook: Code Generator with Evaluation Loop We'll use a Code_Generator agent to write Python code from requirements, and a Code_Reviewer agent to assess it for correctness, style, and documentation. This iterative GroupChat process improves code quality through a generation and review loop.

How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria (e.g., correctness, style, documentation)

Confirm that the generator incorporates feedback into meaningful revisions with each iteration
Track the number of iterations required to reach an acceptable or final version to assess efficiency
Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process

Orchestrator Pattern

Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.

Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.

Notebook: Trip Planner Orchestrator Agent We will build a dynamic travel planning assistant. A GroupChatManager coordinates specialized agents to adapt to the user's evolving travel needs. How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need. (e.g., flights, hotels, local activities)

Confirm that agents contribute meaningfully and only when their domain expertise is required
Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents
Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically

Parallel Agent Execution

Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.

AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.

Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.

How to evaluate: Ensure each parallel agent call produces a distinct and relevant component (e.g., features, value proposition, target customer, tagline)

Confirm that all outputs are successfully collected and synthesized into a cohesive final product description
Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution
Test with varying product types to assess generality and stability of the parallel workflow

Smolagents

SmolAgents is a lightweight Python library for composing tool-using, task-oriented agents. This guide outlines common agent workflows we've implemented—covering routing, evaluation loops, task orchestration, and parallel execution. For each pattern, we include an overview, a reference notebook, and guidance on how to evaluate agent quality.

Design Considerations and Limitations

While the API is minimal—centered on Agent, Task, and Tool—there are important tradeoffs and design constraints to be aware of.

Design Considerations

Limitations

API centered on Agent, Task, and Tool

Tools are just Python functions decorated with @tool. There’s no centralized registry or schema enforcement, so developers must define conventions and structure on their own.

Provides flexibility for orchestration

No retry mechanism or built-in workflow engine

Supports evaluator-optimizer loops, routing, and fan-out/fan-in

Agents are composed, not built-in abstractions

Must implement orchestration logic

Multi-Agent support

No built-in support for collaboration structures like voting, planning, or debate.

Token-level streaming is not supported

No state or memory management out of the box. Applications that require persistent state—such as conversations or multi-turn workflows—will need to integrate external storage (e.g., a vector database or key-value store).

There’s no native memory or “trajectory” tracking between agents. Handoffs between tasks are manual. This is workable in small systems, but may require structure in more complex workflows.

Prompt Chaining

This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.

Notebook: Prompt Chaining with Keyword Extraction + Summarization The agent first extracts keywords from a resume, then summarizes what those keywords suggest.

How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do summaries reflect the extracted keywords?)

Check if the intermediate step (e.g. keyword extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure

Router

Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated agent, often using lightweight classification.

Notebook: Candidate Interview Router The agent classifies candidate profiles into Software, Product, or Design categories, then hands them off to the appropriate evaluation pipeline.

How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right department for a given candidate?)

Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors (e.g., ambiguous or mixed-signal profiles)

Evaluator–Optimizer Loop

This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.

Notebook: Rejection Email Generator with Evaluation Loop An agent writes a candidate rejection email. If the evaluator agent finds the tone or feedback lacking, it asks for a revision.

How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the message respectful, clear, and specific?)

Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability

Orchestrator + Worker Pattern

In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.

Notebook: Recruiting Evaluator Orchestrator The orchestrator delegates resume review, culture fit assessment, and decision-making to different agents, then composes a final recommendation.

How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final recommendation align with the inputs from each worker agent?)

Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components (e.g., technical fit vs. recommendation)

Parallel Agent Execution

When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.

Notebook: Candidate reviews are distributed using asyncio, enabling faster batch processing without compromising output quality.

How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are profiles processed correctly and faster when run in parallel?)

Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency