1 of 52

Learn

Fundamentals

Agents

Agent Workflow Patterns

Workflows are the backbone of many successful LLM applications. They define how language models interact with tools, data, and users—often through a sequence of clearly orchestrated steps. Unlike fully autonomous agents, workflows offer structure and predictability, making them a practical choice for many real-world tasks.

In this guide, we share practical workflows using a variety of agent frameworks, including:

AutoGen
CrewAI
Google GenAI SDK
OpenAI Agents
LangGraph
Smolagents

Each section highlights how to use these tools effectively—showing what’s possible, where they shine, and where a simpler solution might serve you better. Whether you're orchestrating deterministic workflows or building dynamic agentic systems, the goal is to help you choose the right tool for your context and build with confidence.

For a deeper dive into the principles behind agentic systems and when to use them, see Anthropic’s “Building Effective Agents”.

Routing

Agent Routing is the process of directing a task, query, or request to the most appropriate agent based on context or capabilities. In multi-agent systems, it helps determine which agent is best suited to handle a specific input based on skills, domain expertise, or available tools. This enables more efficient, accurate, and specialized handling of complex tasks.

Prompt Chaining

Prompt Chaining is the technique of breaking a complex task into multiple steps, where the output of one prompt becomes the input for the next. This allows a system to reason more effectively, maintain context across steps, and handle tasks that would be too difficult to solve in a single prompt. It's often used to simulate multi-step thinking or workflows.

Parallelization

Parallelization is the process of dividing a task into smaller, independent parts that can be executed simultaneously to speed up processing. It’s used to handle multiple inputs, computations, or agent responses at the same time rather than sequentially. This improves efficiency and speed, especially for large-scale or time-sensitive tasks.

Orchestrator-workers

An orchestrator is a central controller that manages and coordinates multiple components, agents, or processes to ensure they work together smoothly.

It decides what tasks need to be done, who or what should do them, and in what order. An orchestrator can handle things like scheduling, routing, error handling, and result aggregation. It might also manage prompt chains, route tasks to agents, and oversee parallel execution.

Evaluator-Optimizer

An evaluator assesses the quality or correctness of outputs, such as ranking responses, checking for factual accuracy, or scoring performance against a metric. An optimizer uses that evaluation to improve future outputs, either by fine-tuning models, adjusting parameters, or selecting better strategies. Together, they form a feedback loop that helps a system learn what works and refine itself over time.

AutoGen

Use Phoenix to trace and evaluate AutoGen agents

AutoGen is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.

Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.

AutoGen Core Concepts

UserProxyAgent: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.
AssisstantAgent: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.
GroupChat: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.
GroupChatManager: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.
Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.
Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.

Design Considerations and Limitations

Design Consideration

Limitations

Agent Roles

Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.

Termination Conditions

GroupChat may continue even after a logical end, as UserProxyAgent can exhaust all allowed turns before stopping unless termination is explicitly triggered.

Human-in-the-Loop

Fully autonomous mode may miss important judgment calls without user oversight.

State Management

Excessive context can exceed token limits, while insufficient context breaks coherence.

Prompt Chaining

Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.

AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent focuses on a specialized task, while a UserProxyAgent manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.

Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.

How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps (e.g., do identified trends inform the company evaluation?)

Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis
Track total latency and token counts to see which steps cause inefficiencies
Ensure there are no redundant outputs or hallucinations in multi-step reasoning

Routing

Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.

AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.

Notebook: Customer Service Routing Agent We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent .

How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category (e.g., billing, technical support, product info)

Confirm that each query is routed to the appropriate specialized AssistantAgent without ambiguity or misdirection
Test with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately
Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents

Evaluator–Optimizer Loop

The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.

AutoGen's GroupChat architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.

Notebook: Code Generator with Evaluation Loop We'll use a Code_Generator agent to write Python code from requirements, and a Code_Reviewer agent to assess it for correctness, style, and documentation. This iterative GroupChat process improves code quality through a generation and review loop.

How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria (e.g., correctness, style, documentation)

Confirm that the generator incorporates feedback into meaningful revisions with each iteration
Track the number of iterations required to reach an acceptable or final version to assess efficiency
Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process

Orchestrator Pattern

Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.

Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.

Notebook: Trip Planner Orchestrator Agent We will build a dynamic travel planning assistant. A GroupChatManager coordinates specialized agents to adapt to the user's evolving travel needs. How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need. (e.g., flights, hotels, local activities)

Confirm that agents contribute meaningfully and only when their domain expertise is required
Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents
Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically

Parallel Agent Execution

Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.

AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.

Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.

How to evaluate: Ensure each parallel agent call produces a distinct and relevant component (e.g., features, value proposition, target customer, tagline)

Confirm that all outputs are successfully collected and synthesized into a cohesive final product description
Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution
Test with varying product types to assess generality and stability of the parallel workflow

CrewAI

Use Phoenix to trace and evaluate different CrewAI agent patterns

is an open-source framework for building and orchestrating collaborative AI agents that act like a team of specialized virtual employees. Built on LangChain, it enables users to define roles, goals, and workflows for each agent, allowing them to work together autonomously on complex tasks with minimal setup.

Core Concepts of CrewAI

Agents

Agents are autonomous, role-driven entities designed to perform specific functions—like a Researcher, Writer, or Support Rep. They can be richly customized with goals, backstories, verbosity settings, delegation permissions, and access to tools. This flexibility makes agents expressive and task-aware, helping model real-world team dynamics.

Tasks

Tasks are the atomic units of work in CrewAI. Each task includes a description, expected output, responsible agent, and optional tools. Tasks can be executed solo or collaboratively, and they serve as the bridge between high-level goals and actionable steps.

Tools

Tools give agents capabilities beyond language generation—such as browsing the web, fetching documents, or performing calculations. Tools can be native or developer-defined using the BaseTool class, and each must have a clear name and purpose so agents can invoke them appropriately.Tools must include clear descriptions to help agents use them effectively.

Processes

CrewAI supports multiple orchestration strategies:

Sequential: Tasks run in a fixed order—simple and predictable.
Hierarchical: A manager agent or LLM delegates tasks dynamically, enabling top-down workflows.
Consensual (planned): Future support for democratic, collaborative task routing. Each process type shapes how coordination and delegation unfold within a crew.

Crews

A crew is a collection of agents and tasks governed by a defined process. It represents a fully operational unit with an execution strategy, internal collaboration logic, and control settings for verbosity and output formatting. Think of it as the operating system for multi-agent workflows.

Pipelines

Pipelines chain multiple crews together, enabling multi-phase workflows where the output of one crew becomes the input to the next. This allows developers to modularize complex applications into reusable, composable segments of logic.

Planning

With planning enabled, CrewAI generates a task-by-task strategy before execution using an AgentPlanner. This enriches each task with context and sequencing logic, improving coordination—especially in multi-step or loosely defined workflows.

Design Considerations and Limitations

Design Considerations

Features & Limitations

Agent Design Patterns

Prompt Chaining

Prompt chaining decomposes a complex task into a sequence of smaller steps, where each LLM call operates on the output of the previous one. This workflow introduces the ability to add programmatic checks (such as “gates”) between steps, validating intermediate outputs before continuing. The result is higher control, accuracy, and debuggability—at the cost of increased latency.

CrewAI makes it straightforward to build prompt chaining workflows using a sequential process. Each step is modeled as a Task, assigned to a specialized Agent, and executed in order using Process.sequential. You can insert validation logic between tasks or configure agents to flag issues before passing outputs forward.

Notebook: Research-to-Content Prompt Chaining Workflow

Routing

Routing is a pattern designed to classify incoming requests and dispatch them to the single most appropriate specialist agent or workflow, ensuring each input is handled by a focused, expert-driven routine.

In CrewAI, you implement routing by defining a Router Agent that inspects each input, emits a category label, and then dynamically delegates to downstream agents (or crews) tailored for that category—each equipped with its own tools and prompts. This separation of concerns delivers more accurate, maintainable pipelines.

Notebook: Research-Content Routing Workflow

Parallelization

Parallelization is a powerful agent workflow where multiple tasks are executed simultaneously, enabling faster and more scalable LLM pipelines. This pattern is particularly effective when tasks are independent and don’t depend on each other’s outputs.

While CrewAI does not enforce true multithreaded execution, it provides a clean and intuitive structure for defining parallel logic through multiple agents and tasks. These can be executed concurrently in terms of logic, and then gathered or synthesized by a downstream agent.

Notebook: Parallel Research Agent

Orchestrator-Workers

The Orchestrator-Workers workflow centers around a primary agent—the orchestrator—that dynamically decomposes a complex task into smaller, more manageable subtasks. Rather than relying on a fixed structure or pre-defined subtasks, the orchestrator decides what needs to be done based on the input itself. It then delegates each piece to the most relevant worker agent, often specialized in a particular domain like research, content synthesis, or evaluation.

CrewAI supports this pattern using the Process.hierarchical setup, where the orchestrator (as the manager agent) generates follow-up task specifications at runtime. This enables dynamic delegation and coordination without requiring the workflow to be rigidly structured up front. It's especially useful for use cases like multi-step research, document generation, or problem-solving workflows where the best structure only emerges after understanding the initial query.

Notebook: Research & Writing Delegation Agents

Google GenAI SDK (Manual Orchestration)

Everything you need to know about Google's GenAI framework

Google's GenAI SDK is a framework designed to help you interact with Gemini models and models run through VertexAI. Out of all the frameworks detailed in this guide, GenAI SDK is the closest to a base model SDK. While it does provide helpful functions and concepts to streamline tool calling, structured output, and passing files, it does not approach the level of abstraction of frameworks like CrewAI or Autogen.

In April 2025, Google launched its ADK framework, which is a more comparable agent orchestration framework to the others on this list.

That said, because of the relative simplicity of the GenAI SDK, this guide serves as a good learning tool to show how some of the common agent patterns can be manually implemented.

Framework Primitives

GenAI SDK uses contents to represent user messages, files, system messages, function calls, and invocation parameters. That creates relatively simple generation calls:

file = client.files.upload(file='a11.txt')
response = client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents=['Could you summarize this file?', file]
)
print(response.text)

Content objections can also be composed together in a list:

[
  types.UserContent(
    parts=[
      types.Part.from_text('What is this image about?'),
      types.Part.from_uri(
        file_uri: 'gs://generativeai-downloads/images/scones.jpg',
        mime_type: 'image/jpeg',
      )
    ]
  )
]

Patterns

Google GenAI does not include built in orchestration patterns.

Handoffs and State

GenAI has no concept of handoffs natively.

State is handled by maintaining a list of previous messages and other data in a list of content objections. This is similar to how other model SDKs like OpenAI and Anthropic handle the concept of state. This stands in contrast to the more sophisticated measurements of state present in agent orchestration frameworks.

Tools

GenAI does include some conveience features around tool calling. The types.GenerateContentConfig method can automatically convert base python functions into signatures. To do this, the SDK will use the function docstring to understand its purpose and arguments.

def get_current_weather(location: str) -> str:
    """Returns the current weather.

    Args:
      location: The city and state, e.g. San Francisco, CA
    """
    return 'sunny'


response = client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents='What is the weather like in Boston?',
    config=types.GenerateContentConfig(tools=[get_current_weather]),
)

print(response.text)

GenAI will also automatically call the function and incorporate its return value. This goes a step beyond what similar model SDKs do on other platforms. This behavior can be disabled.

Memory

GenAI has no built-in concept of memory.

Multi-Agent Collaboration

GenAI has no built-in collaboration strategies. These must be defined manually.

Streaming

GenAI supports streaming of both text and image responses:

for chunk in client.models.generate_content_stream(
    model='gemini-2.0-flash-001', contents='Tell me a story in 300 words.'
):
    print(chunk.text, end='')

Design Considerations and Limitations

GenAI is the "simplest" framework in this guide, and is closer to a pure model SDK like the OpenAI SDK, rather than an agent framework. It does go a few steps beyond these base SDKs however, notably in tool calling. It is a good option if you're using Gemini models, and want more direct control over your agent system.

Design Considerations

Limitations

Content approach streamlines message management

No built-in orchestration capabilities

Supports automatic tool calling

No state or memory management

Allows for all agent patterns, but each must be manually set up

Primarily designed to work with Gemini models

Agent Design Patterns

Prompt Chaining

This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.

Notebook: Research Agent The agent first researches a topic, then provides an executive summary of its results, then finally recommends future focus directions.

How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do key points reflect the original research?)

Check if the intermediate step (e.g. key point extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure

Router

Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated call, often using lightweight classification.

Notebook: Simple Tool Router This agent shows a simple example of routing use inputs to different tools.

How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right tool for a given input?)

Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors

Evaluator–Optimizer Loop

This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.

Notebook: Story Writing Agent An agent generates an initial draft of a story, then a critique agent decides whether the quality is high enough. If not, it asks for a revision.

How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the story engaging, clear, and well-written?)

Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability

Orchestrator + Worker Pattern

In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.

Notebook: Travel Planning Agent The orchestrator delegates planning a trip for a user, and incorporates a user proxy to improve its quality. The orchestrator delegates to specific functions to plan flights, hotels, and provide general travel recommendations.

How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final output align with the inputs from each worker agent?)

Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components

Parallel Agent Execution

When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.

Notebook: Parallel Research Agent Multiple research topics are examined simultaneously. Once all are complete, the topics are then synthesized into a final combined report.

How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are topics processed correctly and faster when run in parallel?)

Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency

OpenAI Agents

Build multi-agent workflows with OpenAI Agents

OpenAI-Agents is a lightweight Python library for building agentic AI apps. It includes a few abstractions:

Agents, which are LLMs equipped with instructions and tools
Handoffs, which allow agents to delegate to other agents for specific tasks
Guardrails, which enable the inputs to agents to be validated

This guide outlines common agent workflows using this SDK. We will walk through building an investment agent across several use cases.

from agents import Agent, Runner, WebSearchTool

agent = Agent(
    name="Finance Agent",
    instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
    tools=[WebSearchTool()],
    model="gpt-4.1-mini",
)

Design Considerations and Limitations

Design Considerations

Features & Limitations

Model support

First class support for OpenAI LLMs, and basic support for any LLM using a LiteLLM wrapper. Support for reasoning effort parameter to tradeoff on reducing latency or increasing accuracy.

Structured outputs

First-class support with OpenAI LLMs. LLMs that do not support json_schema as a parameter are .

Tools

Very easy, using the @function_call decorator. Support for parallel tool calls to reduce latency. Built-in support for OpenAI SDK for WebSearchTool, ComputerTool, and FileSearchTool

Agent handoff

Very easy using handoffs variable

Multimodal support

Voice support, no support for images or video

Guardrails

Enables validation of both inputs and outputs

Retry logic

⚠️ No retry logic, developers must manually handle failure cases

Memory

⚠️ No built-in memory management. Developers must manage their own conversation and user memory.

Code execution

⚠️ No built-in support for executing code

Simple agent

An LLM agent with access to tools to accomplish a task is the most basic flow. This agent answers questions about stocks and uses OpenAI web search to get real time information.

from agents import Agent, Runner, WebSearchTool

agent = Agent(
    name="Finance Agent",
    instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
    tools=[WebSearchTool()],
    model="gpt-4.1-mini",
)

Prompt chaining

This agent builds a portfolio of stocks and ETFs using multiple agents linked together:

Search Agent: Searches the web for information on particular stock tickers.
Report Agent: Creates a portfolio of stocks and ETFs that supports the user's investment strategy.

portfolio_agent = Agent(
    name="Portfolio Agent",
    instructions="You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation.",
    model="o4-mini",
    output_type=Portfolio,
)

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions="You are a research assistant specializing in financial topics. Given an investment strategy, use web search to retrieve up‑to‑date context and produce a short summary of stocks that support the investment strategy at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
    model="gpt-4.1",
    tools=[WebSearchTool()],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

Parallelization

This agent researches stocks for you. If we want to research 5 stocks, we can force the agent to run multiple tool calls, instead of sequentially.

@function_tool
def get_stock_data(ticker_symbol: str) -> dict:
    """
    Get stock data for a given ticker symbol.
    Args:
        ticker_symbol: The ticker symbol of the stock to get data for.
    Returns:
        A dictionary containing stock data such as price, market cap, and more.
    """
    import yfinance as yf
    stock = yf.Ticker(ticker_symbol)
    return stock.info

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions=dedent(
        """You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
    ),
    model="gpt-4.1",
    tools=[WebSearchTool(), get_stock_data_tool],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

Router agent

This agent answers questions about investing using multiple agents. A central router agent chooses which worker to use.

Research Agent: Searches the web for information about stocks and ETFs.
Question Answering Agent: Answers questions about investing like Warren Buffett.

qa_agent = Agent(
    name="Investing Q&A Agent",
    instructions="You are Warren Buffett. You are answering questions about investing.",
    model="gpt-4.1",
)

research_agent = Agent(
    name="Financial Search Agent",
    instructions="You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
    model="gpt-4.1",
    tools=[WebSearchTool()],
)

orchestrator_agent = Agent(
    name="Routing Agent",
    instructions="You are a senior financial analyst. Your task is to handoff to the appropriate agent or tool.",
    model="gpt-4.1",
    handoffs=[research_agent,qa_agent],
)

Evaluator-Optimizer

When creating LLM outputs, often times the first generation is unsatisfactory. You can use an agentic loop to iteratively improve the output by asking an LLM to give feedback, and then use the feedback to improve the output.

This agent pattern creates reports and evaluates itself to improve its output.

Report Agent (Generation): Creates a report on a particular stock ticker.
Evaluator Agent (Feedback): Evaluates the report and provides feedback on what to improve.

class EvaluationFeedback(BaseModel):
    feedback: str = Field(
        description=f"What is missing from the research report on positive and negative catalysts for a particular stock ticker. Catalysts include changes in {CATALYSTS}.")
    score: Literal["pass", "needs_improvement", "fail"] = Field(
        description="A score on the research report. Pass if the report is complete and contains at least 3 positive and 3 negative catalysts for the right stock ticker, needs_improvement if the report is missing some information, and fail if the report is completely wrong.")


report_agent = Agent(
    name="Catalyst Report Agent",
    instructions=dedent(
        """You are a research assistant specializing in stock research. Given a stock ticker, generate a report of 3 positive and 3 negative catalysts that could move the stock price in the future in 50 words or less."""
    ),
    model="gpt-4.1",
)

evaluation_agent = Agent(
    name="Evaluation Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
    ),
    model="gpt-4.1",
    output_type=EvaluationFeedback,
)

Orchestrator worker

This is the most advanced pattern in the examples, using orchestrators and workers together. The orchestrator chooses which worker to use for a specific sub-task. The worker attempts to complete the sub-task and return a result. The orchestrator then uses the result to choose the next worker to use until a final result is returned.

In the following example, we'll build an agent which creates a portfolio of stocks and ETFs based on a user's investment strategy.

Orchestrator: Chooses which worker to use based on the user's investment strategy.
Research Agent: Searches the web for information about stocks and ETFs that could support the user's investment strategy.
Evaluation Agent: Evaluates the research report and provides feedback on what data is missing.
Portfolio Agent: Creates a portfolio of stocks and ETFs based on the research report.

evaluation_agent = Agent(
    name="Evaluation Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
    ),
    model="gpt-4.1",
    output_type=EvaluationFeedback,
)

portfolio_agent = Agent(
    name="Portfolio Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation."""
    ),
    model="o4-mini",
    output_type=Portfolio,
)

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions=dedent(
        """You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
    ),
    model="gpt-4.1",
    tools=[WebSearchTool()],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

orchestrator_agent = Agent(
    name="Routing Agent",
    instructions=dedent("""You are a senior financial analyst. You are trying to create a portfolio based on my stated investment strategy. Your task is to handoff to the appropriate agent or tool.

    First, handoff to the research_agent to give you a report on stocks and ETFs that could support the user's stated investment strategy.
    Then, handoff to the evaluation_agent to give you a score on the research report. If the evaluation_agent returns a needs_improvement or fail, continue using the research_agent to gather more information.
    Once the evaluation_agent returns a pass, handoff to the portfolio_agent to create a portfolio."""),
    model="gpt-4.1",
    handoffs=[
        research_agent,
        evaluation_agent,
        portfolio_agent,
    ],
)

This uses the following structured outputs.

class PortfolioItem(BaseModel):
    ticker: str = Field(description="The ticker of the stock or ETF.")
    allocation: float = Field(
        description="The percentage allocation of the ticker in the portfolio. The sum of all allocations should be 100."
    )
    reason: str = Field(description="The reason why this ticker is included in the portfolio.")


class Portfolio(BaseModel):
    tickers: list[PortfolioItem] = Field(
        description="A list of tickers that could support the user's stated investment strategy."
    )


class EvaluationFeedback(BaseModel):
    feedback: str = Field(
        description="What data is missing in order to create a portfolio of stocks and ETFs based on the user's investment strategy."
    )
    score: Literal["pass", "needs_improvement", "fail"] = Field(
        description="A score on the research report. Pass if you have at least 5 tickers with data that supports the user's investment strategy to create a portfolio, needs_improvement if you do not have enough supporting data, and fail if you have no tickers."
    )

LangGraph

Use Phoenix to trace and evaluate agent frameworks built using Langgraph

This guide explains key LangGraph concepts, discusses design considerations, and walks through common architectural patterns like orchestrator-worker, evaluators, and routing. Each pattern includes a brief explanation and links to runnable Python notebooks.

Core LangGraph Concepts

LangGraph allows you to build LLM-powered applications using a graph of steps (called "nodes") and data (called "state"). Here's what you need to know to understand and customize LangGraph workflows:

State

A TypedDict that stores all information passed between nodes. Think of it as the memory of your workflow. Each node can read from and write to the state.

Nodes

Nodes are units of computation. Most often these are functions that accept a State input and return a partial update to it. Nodes can do anything: call LLMs, trigger tools, perform calculations, or prompt users.

Edges

Directed connections that define the order in which nodes are called. LangGraph supports linear, conditional, and cyclical edges, which allows for building loops, branches, and recovery flows.

Conditional Routing

A Python function that examines the current state and returns the name of the next node to call. This allows your application to respond dynamically to LLM outputs, tool results, or even human input.

Send API

A way to dynamically launch multiple workers (nodes or subgraphs) in parallel, each with their own state. Often used in orchestrator-worker patterns where the orchestrator doesn't know how many tasks there will be ahead of time.

Agent Supervision

LangGraph enables complex multi-agent orchestration using a Supervisor node that decides how to delegate tasks among a team of agents. Each agent can have its own tools, prompt structure, and output format. The Supervisor coordinates routing, manages retries, and ensures loop control.

Checkpointing and Persistence

LangGraph supports built-in persistence using checkpointing. Each execution step saves state to a database (in-memory, SQLite, or Postgres). This allows for:

Multi-turn conversations (memory)
Rewinding to past checkpoints (time travel)
Human-in-the-loop workflows (pause + resume)

Design Considerations & Limitations

LangGraph improves on LangChain by supporting more flexible and complex workflows. Here’s what to keep in mind when designing:

Benefits

Limitations

Patterns

Prompt Chaining

A linear sequence of prompt steps, where the output of one becomes the input to the next. This workflow is optimal when the task can be simply broken down into concrete subtasks.

Use case: Multistep reasoning, query rewriting, or building up answers gradually.

📓

Parallelization

Runs multiple LLMs in parallel — either by splitting tasks (sectioning) or getting multiple opinions (voting).

Use case: Combining diverse outputs, evaluating models from different angles, or running safety checks.

With the Send API, LangGraph lets you:

Launch multiple safety evaluators in parallel
Compare multiple generated hypotheses side-by-side
Run multi-agent voting workflows

This improves reliability and reduces bottlenecks in linear pipelines.

📓

Router

Routes an input to the most appropriate follow-up node based on its type or intent.

Use case: Customer support bots, intent classification, or model selection.

LangGraph routers enable domain-specific delegation — e.g., classify an incoming query as "billing", "technical support", or "FAQ", and send it to a specialized sub-agent. Each route can have its own tools, memory, and context. Use structured output with a routing schema to make classification more reliable.

📓

Evaluator–Optimizer Loop

One LLM generates content, another LLM evaluates it, and the loop repeats until the evaluation passes. LangGraph allows feedback to modify the state, making each round better than the last.

Use case: Improving code, jokes, summaries, or any generative output with measurable quality.

📓

Orchestrator–Worker

An orchestrator node dynamically plans subtasks and delegates each to a worker LLM. Results are then combined into a final output.

Use case: Writing research papers, refactoring code, or composing modular documents.

LangGraph’s Send API lets the orchestrator fork off tasks (e.g., subsections of a paper) and gather them into completed_sections. This is especially useful when the number of subtasks isn’t known in advance.

You can also incorporate agents like PDF_Reader or a WebSearcher, and the orchestrator can choose when to route to these workers.

⚠️ Caution: Feedback loops or improper edge handling can cause workers to echo each other or create infinite loops. Use strict conditional routing to avoid this.

📓

Smolagents

SmolAgents is a lightweight Python library for composing tool-using, task-oriented agents. This guide outlines common agent workflows we've implemented—covering routing, evaluation loops, task orchestration, and parallel execution. For each pattern, we include an overview, a reference notebook, and guidance on how to evaluate agent quality.

Design Considerations and Limitations

While the API is minimal—centered on Agent, Task, and Tool—there are important tradeoffs and design constraints to be aware of.

Design Considerations

Limitations

Prompt Chaining

Notebook: The agent first extracts keywords from a resume, then summarizes what those keywords suggest.

How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do summaries reflect the extracted keywords?)

Check if the intermediate step (e.g. keyword extraction) is meaningful and accurate
Ensure the final output reflects or builds on the intermediate output
Compare chained vs. single-step prompting to see if chaining improves quality or structure

Router

Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated agent, often using lightweight classification.

Notebook: The agent classifies candidate profiles into Software, Product, or Design categories, then hands them off to the appropriate evaluation pipeline.

How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right department for a given candidate?)

Compare routing decisions to human-labeled ground truth or expectations
Track precision/recall if framed as a classification task
Monitor for edge cases and routing errors (e.g., ambiguous or mixed-signal profiles)

Evaluator–Optimizer Loop

Notebook: An agent writes a candidate rejection email. If the evaluator agent finds the tone or feedback lacking, it asks for a revision.

How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the message respectful, clear, and specific?)

Measure how many iterations are needed to reach an acceptable result
Evaluate final output quality against criteria like tone, clarity, and specificity
Compare the evaluator’s judgment to human reviewers to calibrate reliability

Orchestrator + Worker Pattern

In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.

Notebook: The orchestrator delegates resume review, culture fit assessment, and decision-making to different agents, then composes a final recommendation.

How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final recommendation align with the inputs from each worker agent?)

Ensure each worker agent completes its role accurately and in isolation
Check if the orchestrator integrates worker outputs into a consistent final result
Look for agreement or contradictions between components (e.g., technical fit vs. recommendation)

Parallel Agent Execution

When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.

Notebook: Candidate reviews are distributed using asyncio, enabling faster batch processing without compromising output quality.

How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are profiles processed correctly and faster when run in parallel?)

Confirm that outputs are consistent with those from a sequential execution
Track total latency and per-task runtime to assess parallel speedup
Watch for race conditions, dropped inputs, or silent failures in concurrency

Evaluating Multi-Agent Systems

Evaluating multi-agent systems involves unique challenges compared to single-agent evaluations. This guide provides clear explanations of various architectures, strategies for effective evaluation, and additional considerations.

Understanding Multi-Agent Systems

A multi-agent system consists of multiple agents, each using an LLM (Large Language Model) to control application flows. As systems grow, you may encounter challenges such as agents struggling with too many tools, overly complex contexts, or the need for specialized domain knowledge (e.g., planning, research, mathematics). Breaking down applications into multiple smaller, specialized agents often resolves these issues.

Benefits of Multi-Agent Systems

Modularity: Easier to develop, test, and maintain.
Specialization: Expert agents handle specific domains.
Control: Explicit control over agent communication.

Multi-Agent Architectures

Multi-agent systems can connect agents in several ways:

Architecture Type

Description

Evaluation Considerations

Network

Agents can communicate freely with each other, each deciding independently whom to contact next.

Assess communication efficiency, decision quality on agent selection, and coordination complexity.

Supervisor

Agents communicate exclusively with a single supervisor that makes all routing decisions.

Evaluate supervisor decision accuracy, efficiency of routing, and effectiveness in task management.

Supervisor (Tool-calling)

Supervisor uses an LLM to invoke agents represented as tools, making explicit tool calls with arguments.

Evaluate tool-calling accuracy, appropriateness of arguments passed, and supervisor decision quality.

Hierarchical

Systems with supervisors of supervisors, allowing complex, structured flows.

Evaluate communication efficiency, decision-making at each hierarchical level, and overall system coherence.

Custom Workflow

Agents communicate within predetermined subsets, combining deterministic and agent-driven decisions.

Evaluate workflow efficiency, clarity of communication paths, and effectiveness of the predetermined control flow.

Core Evaluation Strategies Explained

There are a few different strategies for evaluating multi agent applications.

1. Agent Handoff Evaluation

When tasks transfer between agents, evaluate:

Appropriateness: Is the timing logical?
Information Transfer: Was context transferred effectively?
Timing: Optimal handoff moment.

2. System-Level Evaluation

Measure holistic performance:

End-to-End Task Completion
Efficiency: Number of interactions, processing speed
User Experience

3. Coordination Evaluation

Evaluate cooperative effectiveness:

Communication Quality
Conflict Resolution
Resource Management

Additional Evaluation Considerations

Multi-agent systems introduce added complexity:

Complexity Management: Evaluate agents individually, in pairs, and system-wide.
Emergent Behaviors: Monitor for collective intelligence and unexpected interactions.
Evaluation Granularity:
- Agent-level: Individual performance
- Interaction-level: Agent interactions
- System-level: Overall performance
- User-level: End-user experience
Performance Metrics: Latency, throughput, scalability, reliability, operational cost

Practical Approaches to Evaluation

Leverage Single-Agent Evaluations

Adapt single-agent evaluation methods like tool-calling evaluations and planning assessments.

See our guide on agent evals and use our pre-built evals that you can leverage in Phoenix.

Develop Multi-Agent Specific Evaluations

Focus evaluations on coordination efficiency, overall system efficiency, and emergent behaviors.

See our docs for creating your own custom evals in Phoenix.

Hierarchical Evaluation

Structure evaluations to match architecture:

Bottom-Up: From individual agents upward.
Top-Down: From system goals downward.
Hybrid: Combination for comprehensive coverage.

Tracing

What are Traces

A deep dive into the details of a trace

Spans

A span represents a unit of work or operation (think a span of time). It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.

A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks. A span for an LLM execution in JSON format is displayed below

Spans can be nested, as is implied by the presence of a parent span ID: child spans represent sub-operations. This allows spans to more accurately capture the work done in an application.

Traces

A trace records the paths taken by requests (made by an application or end-user) as they propagate through multiple steps.

Without tracing, it is challenging to pinpoint the cause of performance problems in a system.

It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for LLM applications, which commonly have nondeterministic problems or are too complicated to reproduce locally.

Tracing makes debugging and understanding LLM applications less daunting by breaking down what happens within a request as it flows through a system.

A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).

Projects

A project is a collection of traces. You can think of a project as a container for all the traces that are related to a single application or service. You can have multiple projects, and each project can have multiple traces. Projects can be useful for various use-cases such as separating out environments, logging traces for evaluation runs, etc. To learn more about how to setup projects, see the how-to guide

Span Kind

When a span is created, it is created as one of the following: Chain, Retriever, Reranker, LLM, Embedding, Agent, or Tool.

CHAIN

A Chain is a starting point or a link between different LLM application steps. For example, a Chain span could be used to represent the beginning of a request to an LLM application or the glue code that passes context from a retriever to and LLM call.

RETRIEVER

A Retriever is a span that represents a data retrieval step. For example, a Retriever span could be used to represent a call to a vector store or a database.

RERANKER

A Reranker is a span that represents the reranking of a set of input documents. For example, a cross-encoder may be used to compute the input documents' relevance scores with respect to a user query, and the top K documents with the highest scores are then returned by the Reranker.

LLM

An LLM is a span that represents a call to an LLM. For example, an LLM span could be used to represent a call to OpenAI or Llama.

EMBEDDING

An Embedding is a span that represents a call to an LLM for an embedding. For example, an Embedding span could be used to represent a call OpenAI to get an ada-2 embedding for retrieval.

TOOL

A Tool is a span that represents a call to an external tool such as a calculator or a weather API.

AGENT

A span that encompasses calls to LLMs and Tools. An agent describes a reasoning block that acts on tools using the guidance of an LLM.\

Span Attributes

Attributes are key-value pairs that contain metadata that you can use to annotate a span to carry information about the operation it is tracking.

For example, if a span invokes an LLM, you can capture the model name, the invocation parameters, the token count, and so on.

Attributes have the following rules:

Keys must be non-null string values
Values must be a non-null string, boolean, floating point value, integer, or an array of these values Additionally, there are Semantic Attributes, which are known naming conventions for metadata that is typically present in common operations. It's helpful to use semantic attribute naming wherever possible so that common kinds of metadata are standardized across systems. See for more information.

Example OTEL Spans

Below are example OTEL spans for each OpenInference spanKind to be used as reference when doing manual instrumentation

How Tracing Works

The components behind tracing

Instrumentation

In order for an application to emit traces for analysis, the application must be instrumented. Your application can be manually instrumented or be automatically instrumented. With phoenix, there a set of plugins (instrumentors) that can be added to your application's startup process that perform auto-instrumentation. These plugins collect spans for your application and export them for collection and visualization. For phoenix, all the instrumentors are managed via a single repository called . The comprehensive list of instrumentors can be found in the how-to guide.

Exporter

An exporter takes the spans created via instrumentation and exports them to a collector. In simple terms, it just sends the data to the Phoenix. When using Phoenix, most of this is completely done under the hood when you call instrument on an instrumentor.

Collector

The Phoenix server is a collector and a UI that helps you troubleshoot your application in real time. When you run or run phoenix (e.x. px.launch_app(), container), Phoenix starts receiving spans from any application(s) that is exporting spans to it.

OpenTelemetry Protocol

OpenTelemetry Protocol (or OTLP for short) is the means by which traces arrive from your application to the Phoenix collector. Phoenix currently supports OTLP over HTTP.

Annotations Concepts

Annotation Types

Depending on what you want to do with your annotations, you may want to configure a rubric for what your annotation represents - e.g. is it a category, number with a range (continuous), or freeform.

Annotation type: - Categorical: Predefined labels for selection. (e.g. 👍 or 👎) - Continuous: a score across a specified range. (e.g. confidence score 0-100) - Freeform: Open-ended text comments. (e.g. "correct")
Optimize the direction based on your goal: - Maximize: higher scores are better. (e.g. confidence) - Minimize: lower scores are better. (e.g. hallucinations) - None: direction optimization does not apply. (e.g. tone)

See for more details.

Annotation Targets

Phoenix supports annotating different annotation targets to capture different levels of LLM application performance. The core annotation types include:

Span Annotations: Applied to individual spans within a trace, providing granular feedback about specific components
Document Annotations: Specifically for retrieval systems, evaluating individual documents with metrics like relevance and precision

Each annotation can include:

Labels: Text-based classifications (e.g., "helpful" or "not helpful")
Scores: Numeric evaluations (e.g., 0-1 scale for relevance)
Explanations: Detailed justifications for the annotation

These annotations can come from different sources:

Human feedback (e.g., thumbs up/down from end-users)
LLM-as-a-judge evaluations (automated assessments)
Code-based evaluations (programmatic metrics)

Phoenix also supports specialized evaluation metrics for retrieval systems, including NDCG, Precision@K, and Hit Rate, making it particularly useful for evaluating search and retrieval components of LLM applications.

Feedback from End-users

Human feedback allows you to understand how your users are experiencing your application and helps draw attention to problematic traces. Phoenix makes it easy to collect feedback for traces and view it in the context of the trace, as well as filter all your traces based on the feedback annotations you send. Before anything else, you want to know if your users or customers are happy with your product. This can be as straightforward as adding 👍 👎 buttons to your application, and logging the result as annotations.

For more information on how to wire up your application to collect feedback from your users, see .

Evaluations from LLMs

When you have large amounts of data it can be immensely efficient and valuable to leverage LLM judges via evals to produce labels and scores to annotate your traces with. Phoenix's evals library as well as other third-party eval libraries can be leveraged to annotate your spans with evaluations. For details see to:

Generate evaluation results
Add evaluation results to spans

Human Annotations

Sometimes you need to rely on human annotators to attach feedback to specific traces of your application. Human annotations through the UI can be thought of as manual quality assurance. While it can be a bit more labor intensive, it can help in sharing insights within a team, curating datasets of good/bad examples, and even in training an LLM judge.

How to Use Annotations

Annotations can help you share valuable insight about how your application is performing. However making these insights actionable can be difficult. With Phoenix, the annotations you add to your trace data is propagated to datasets so that you can use the annotations during experimentation.

Track Improvements during Experimentation

Since Phoenix datasets preserve the annotations, you can track whether or not changes to your application (e.g. experimentation) produce better results (e.g. better scores / labels). Phoenix evaluators have access to the example metadata at evaluation time, making it possible to track improvements / regressions over previous generations (e.g. the previous annotations).

Train an LLM Judge

AI development currently faces challenges when evaluating LLM application outputs at scale:

Human annotation is precise but time-consuming and impossible to scale efficiently.
Existing automated methods using LLM judges require careful prompt engineering and often fall short of capturing human evaluation nuances.
Solutions requiring extensive human resources are difficult to scale and manage.

These challenges create a bottleneck in the rapid development and improvement of high-quality LLM applications.

Since Phoenix datasets preserve the annotations in the example metadata, you can use datasets to build human-preference calibrated judges using libraries and tools such as DSPy and Zenbase.

Annotator Kind

Phoenix supports three types of annotators: Human, LLM, and Code.

Annotator Kind

Source

Purpose

Strengths

Use Case

Human

Manual review

Expert judgment and quality assurance

High accuracy, nuanced understanding

Manual QA, edge cases, subjective evaluation

LLM

Language model output

Scalable evaluation of application responses

Fast, scalable, consistent across examples

Large-scale output scoring, pattern review

Code

Programmatic evaluators

Automated assessment based on rules/metrics

Objective, repeatable, useful in experiments

Model benchmarking, regression testing

Annotation Source

Phoenix provides two interfaces for annotations: API and APP. The API interface via the REST clients enables automated feedback collection at scale, such as collecting thumbs up/down from end-users in production, providing real-time insights into LLM system performance. The APP interface via the UI offers an efficient workflow for human annotators with hotkey support and structured configurations, making it practical to create high-quality training sets for LLMs.

The combination of these interfaces creates a powerful feedback loop: human annotations through the APP interface help train and calibrate LLM evaluators, which can then be deployed at scale via the API. This cycle of human oversight and automated evaluation helps identify the most valuable examples for review while maintaining quality at scale.

Annotation Configuration

Annotation configurations in Phoenix are designed to maximize efficiency for human annotators. The system allows you to define the structure of annotations (categorical or continuous values, with appropriate bounds and options) and pair these with keyboard shortcuts (hotkeys) to enable rapid annotation.

For example, a categorical annotation might be configured with specific labels that can be quickly assigned using number keys, while a continuous annotation might use arrow keys for fine-grained scoring. This combination of structured configurations and hotkey support allows annotators to provide feedback quickly, significantly reducing the effort required for manual annotation tasks.

The primary goal is to streamline the annotation workflow, enabling human annotators to process large volumes of data efficiently while maintaining quality and consistency in their feedback.

FAQs: Tracing

How to log traces

To log traces, you must instrument your application either manually or automatically. To log to a remote instance of Phoenix, you must also configure the host and port where your traces will be sent.

When running Phoenix locally on the default port of 6006, no additional configuration is necessary.

If you are running a remote instance of Phoenix, you can configure your instrumentation to log to that instance using the PHOENIX_HOST and PHOENIX_PORT environment variables.

Alternatively, you can use the PHOENIX_COLLECTOR_ENDPOINT environment variable.

How to turn off tracing

Tracing can be paused temporarily or disabled permanently.

Pause tracing using context manager

If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing context manager as shown below.

Uninstrument the auto-instrumentors permanently

Calling .uninstrument() on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.

For OpenAI, how do I get token counts when streaming?

To get token counts when streaming, install openai>=1.26 and set stream_options={"include_usage": True} when calling create. Below is an example Python code snippet. For more info, see .

Using a custom LangChain component

If you have customized a LangChain component (say a retriever), you might not get tracing for that component without some additional steps. Internally, instrumentation relies on components to inherit from LangChain base classes for the traces to show up. Below is an example of how to inherit from LanChain base classes to make a and to make traces show up.

Prompt Engineering

Prompts Concepts

Prompt

Prompts often times refer to the content of how you "prompt" a LLM, e.g. the "text" that you send to a model like OpenAI's gpt-4. Within Phoenix we expand this definition to be everything that's needed to prompt:

The prompt template of the messages to send to a completion endpoint
The invocation parameters (temperature, frequency penalty, etc.)
The tools made accessible to the LLM (e.g. weather API)
The response format (sometimes called the output schema) used for when you have JSON mode enabled.

This expanded definition of a prompt lets you more deterministically invoke LLMs with confidence as everything is snapshotted for you to use within your application.

Prompt Templates

Although the terms prompt and prompt template get used interchangeably, it's important to know the difference.

Prompts refer to the message(s) that are passed into the language model.

Prompt Templates refer to a way of formatting information to get the prompt to hold the information you want (such as context and examples). Prompt templates can include placeholders (variables) for things such as examples (e.g. few-shot), outside context (RAG), or any other external data that is needed.

Prompt Version

Every time you save a prompt within Phoenix, a snapshot of the prompt is saved as a prompt version. Phoenix does this so that you not only can view the changes to a prompt over time but also so that you can build confidence about a specific prompt version before using it within your application. With every prompt version phoenix tracks the author of the prompt and the date at which the version was saved.

Similar to the way in which you can track changes to your code via git shas, Phoenix tracks each change to your prompt with a prompt_id.

Prompt Version Tag

Imagine you’re working on a AI project, and you want to label specific versions of your prompts so you can control when and where they get deployed. This is where prompt version tags come in.

A prompt version tag is like a sticky note you put on a specific version of your prompt to mark it as important. Once tagged, that version won’t change, making it easy to reference later.

When building applications, different environments are often used for different stages of readiness before going live, for example:

Development – Where new features are built.
Staging – Where testing happens.
Production – The live system that users interact with.

Tagging prompt versions with environment tags can enable building, testing, and deploying prompts in the same way as an application—ensuring that prompt changes can be systematically tested and deployed.

In addition to environment tags, custom Git tags allow teams to label code versions in a way that fits their specific workflow (`v0.0.1`). These tags can be used to signal different stages of deployment, feature readiness, or any other meaningful status. Prompt version tags work exactly the same way as git tags.

Prompt Format

Prompts can be formatted to include any attributes from spans or datasets. These attributes can be added as F-Strings or using Mustache formatting.

F-strings should be formatted with single {s:

{question}

{% hint style="info" %} To escape a { when using F-string, add a second { in front of it, e.g., {{escaped}} {not-escaped}. Escaping variables will remove them from inputs in the Playground. {% endhint %}

Mustache should be formatted with double {{s:

{{question}}

{% hint style="info" %} We recommend using Mustache where possible, since it supports nested attributes, e.g. attributes.input.value, more seamlessly {% endhint %}

Tools

Tools allow LLMs to interact with the external environment. This can allow LLMs to interface with your application in more controlled ways. Given a prompt and some tools to choose from an LLM may choose to use some (or one) tools or not. Many LLM API's also expose a tool choice parameter which allow you to constrain how and which tools are selected.

Here is an example of what a tool would look like for the weather API using OpenAI.

{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
            },
            "required": ["location"],
        }
    }
}

Response Format

Some LLMs support structured responses, known as response format or output schema, allowing you to specify an exact schema for the model’s output.

Structured Outputs ensure the model consistently generates responses that adhere to a defined JSON Schema, preventing issues like missing keys or invalid values.

Benefits of Structured Outputs:

Reliable type-safety: Eliminates the need to validate or retry incorrectly formatted responses.
Explicit refusals: Enables programmatic detection of safety-based refusals.
Simpler prompting: Reduces reliance on strongly worded prompts for consistent formatting.

For more details, check out this OpenAI guide.

Context Engineering

Context Engineering Concepts

An independent deep‑dive into the science of giving language‑model agents the “right mind at the right moment.”

 What Is Context Engineering?

Context engineering is the practice of deciding exactly what information a large language model (LLM)—or a group of LLM agents—should see when doing a task. This includes what data is shown, how it’s organized, and how it’s framed.

We break context into four main parts:

Information (I): Facts, documents, or intermediate results passed into the model.
State (S): What the model needs to know about the current session—like the conversation so far or the structure of a task.
Tools (T): External systems the model can access, like APIs or data sources.
Format (F): How everything is wrapped—prompt templates, instructions, or response formats.

By treating these pieces like we treat code—versioning, testing, measuring, and improving them—we can make LLM outputs more predictable and reliable across use cases.

Scaling Agents Means Scaling Context

Large models perform well in single-shot tasks, but real-world systems often use agents that delegate, call APIs, and persist across time. In these long-running setups, common failure modes include:

Context drift: Agents develop conflicting views of the truth.
Bandwidth overload: Passing full histories strains context limits and slows responses.
Tool blindness: Agents get raw data but lack guidance on how to use it.

In practice, stale or inconsistent context is the leading cause of coordination failures. Even basic memory update strategies can significantly alter agent behavior over time—highlighting the need for deliberate memory management.

Prompt Engineering ≠ Context Engineering

Prompt engineering and context engineering are related but distinct disciplines. Both shape how language models behave—but they operate at different levels of abstraction.

Prompt engineering focuses on the how: crafting the right wording, tone, and examples to guide the model’s behavior in a single interaction. It’s about writing the best possible "function call" to the model.

Context engineering, by contrast, governs the what, when, and how of the information the model observes. It spans entire workflows, manages memory across turns, and ensures the model has access to relevant tools and schemas. If prompt engineering is writing a clean function call, context engineering is architecting the full service contract—including interfaces, dependencies, and state management.

Dimension

Prompt Engineering

Context Engineering

Context engineering becomes essential when systems move from isolated prompts to persistent agents and long-running applications. It enables scalable coordination, memory, and interaction across tasks—turning a language model from a tool into part of a system.

Six Principles for Optimizing Context

As systems grow beyond one-off prompts and into long-running workflows, context becomes a key engineering surface. These principles guide how to design and manage context for LLMs and agent-based systems.

Principle

Why It Matters

Example Technique

Each principle pushes context design toward systems that are leaner, more interpretable, and better aligned with both model behavior and downstream actions.

Applying Context at System Scale

Systems that rely on long or complex context need well-designed memory. The patterns below offer practical ways to manage context, depending on how much information your system handles and how long it needs to remember it.

Three-tier memory breaks context into three layers: short-term (exact text), mid-term (summaries), and long-term (titles or embeddings). This makes it easier to keep recent details while still remembering important older information. It’s a good fit for chats or agents that run over many turns. Hierarchical Memory Transformers (HMT) follow this design.

Recurrent compression buffers take earlier parts of a stream—like a transcript or log—and compress them into smaller representations that can be brought back later if needed. This saves space while keeping the option to recall details when relevant.

State-space backbones store memory outside the prompt using a hidden state that carries over between turns. This lets the model handle much longer sequences efficiently. It’s especially useful in devices with tight memory or speed limits, like mobile or edge systems. Mamba is one example of this pattern.

Context cache and KV-sharing spread memory across different servers by saving reusable attention patterns. This avoids repeating work and keeps prompts small, making it a strong choice for systems running many requests in parallel. MemServe uses this technique.

Hybrid retrieval combines two steps: first, it filters data using keywords or metadata; then it uses vector search for meaning. This cuts down on irrelevant results, especially in datasets with lots of similar content.

Graph-of-thought memory turns ideas into a graph, where entities and their relationships are nodes and edges. Instead of sending the whole graph to the model, only the relevant part is used. This works well for complex tasks like analysis or knowledge reasoning and is often built with tools like Neo4j or TigerGraph.

Each of these patterns offers a different way to scale memory and context depending on the problem. They help systems stay efficient, accurate, and responsive as context grows.

How to Optimize Context Like Code

Log every prompt and context segment. Track exactly what the model sees at each step.
Label each span. Mark whether it was used, ignored, hallucinated, or contributed to the final output.
Measure return on input (ROI). For each span, calculate: ROI = token cost ÷ impact on accuracy.
Trim low-value spans. Drop spans with low ROI. Keep references (pointers) in case retrieval is needed later.
Train a salience model. Predict which spans should be included in context automatically, based on past usefulness.
Test with adversarial context. Shuffle inputs or omit key details to probe model robustness and dependency on context structure.
Run regression evaluations. Repeatedly test the system across agent roles and tasks to catch context-related drift or failures.
Version and diff context bundles. Treat context like code—snapshot, compare, and review changes before release.

From Prompts to Protocols - Takeaways

Multi-agent systems are powerful because they divide knowledge and responsibility across roles. But that same structure becomes fragile when context is outdated, overloaded, or misaligned.

Context engineering turns prompting from trial-and-error into system design. It ensures each agent sees the right information, in the right form, at the right time.

To build reliable systems, treat context as a core artifact—not just an input. Observe it. Version it. Optimize it. With that foundation, agents stop behaving like chat interfaces and start acting like collaborators.

Datasets and Experiments

Datasets Concepts

Datasets

Datasets are integral to evaluation and experimentation. They are collections of examples that provide the inputs and, optionally, expected reference outputs for assessing your application. Each example within a dataset represents a single data point, consisting of an inputs dictionary, an optional output dictionary, and an optional metadata dictionary. The optional output dictionary often contains the the expected LLM application output for the given input.

Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are then used to run experiments and evaluations to track improvements.

Use datasets to:

Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
Capture generations to assess quality manually or using LLM-graded evals
Store user reviewed generations to find new test cases

With Phoenix, datasets are:

Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.
Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.

Creating Datasets

There are various ways to get started with datasets:

Manually Curated Examples

This is how we recommend you start. From building your application, you probably have an idea of what types of inputs you expect your application to be able to handle, and what "good" responses look like. You probably want to cover a few different common edge cases or situations you can imagine. Even 20 high quality, manually curated examples can go a long way.

Historical Logs

Once you ship an application, you start gleaning valuable information: how users are actually using it. This information can be valuable to capture and store in datasets. This allows you to test against specific use cases as you iterate on your application.

If your application is going well, you will likely get a lot of usage. How can you determine which datapoints are valuable to add? There are a few heuristics you can follow. If possible, try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future. You can also use other heuristics to identify interesting datapoints - for example, runs that took a long time to complete could be interesting to analyze and add to a dataset.

Synthetic Data

Once you have a few examples, you can try to artificially generate examples to get a lot of datapoints quickly. It's generally advised to have a few good handcrafted examples before this step, as the synthetic data will often resemble the source examples in some way.

Dataset Contents

While Phoenix doesn't have dataset types, conceptually you can contain:

Key-Value Pairs:

Inputs and outputs are arbitrary key-value pairs.
This dataset type is ideal for evaluating prompts, functions, and agents that require multiple inputs or generate multiple outputs.

If you have a RAG prompt template such as:

Given the context information and not prior knowledge, answer the query.
---------------------
{context}
---------------------

Query: {query}
Answer:

Your dataset might look like:

Input

Output

{

"query": "What is Paul Graham known for?",

"context": "Paul Graham is an investor, entrepreneur, and computer scientist known for..."

}

{

"answer": "Paul Graham is known for co-founding Y Combinator, for his writing, and for his work on the Lisp programming language." }

LLM inputs and outputs:

Simply capture the input and output as a single string to test the completion of an LLM.
The "inputs" dictionary contains a single "input" key mapped to the prompt string.
The "outputs" dictionary contains a single "output" key mapped to the corresponding response string.

Input

Output

{

"input": "do you have to have two license plates in ontario" }

{

"output": "true"

}

{

"input": "are black beans the same as turtle beans" }

{ "output": "true" }

Messages or chat:

This type of dataset is designed for evaluating LLM structured messages as inputs and outputs.
The "inputs" dictionary contains a "messages" key mapped to a list of serialized chat messages.
The "outputs" dictionary contains a "messages" key mapped to a list of serialized chat messages.
This type of data is useful for evaluating conversational AI systems or chatbots.

Input

Output

{ "messages": [{ "role": "system", "content": "You are an expert SQL..."}] }

{ "messages": [{ "role": "assistant", "content": "select * from users"}] }

{ "messages": [{ "role": "system", "content": "You are a helpful..."}] }

{ "messages": [{ "role": "assistant", "content": "I don't know the answer to that"}] }

Types of Datasets

Depending on the type of contents of a given dataset, you might consider the dataset be a certain type.

Golden Dataset

A dataset that contains the inputs and the ideal "golden" output is often times is referred to as a Golden Dataset. These datasets are hand-labeled dataset and are used in evaluating the performance of LLMs or prompt templates. T.A golden dataset could look something like

Input

Output

Paris is the capital of France

True

Canada borders the United States

True

The native language of Japan is English

False

Evaluation

Evaluators

Phoenix offers key modules to measure the quality of generated results as well as modules to measure retrieval quality.

Response Evaluation: Does the response match the retrieved context? Does it also match the query?
Retrieval Evaluation: Are the retrieved sources relevant to the query?

Response Evaluation

Evaluation of generated results can be challenging. Unlike traditional ML, the predicted results are not numeric or categorical, making it hard to define quantitative metrics for this problem.

Phoenix offers , a module designed to measure the quality of results. This module uses a "gold" LLM (e.g. GPT-4) to decide whether the generated answer is correct in a variety of ways. Note that many of these evaluation criteria DO NOT require ground-truth labels. Evaluation can be done simply with a combination of the input (query), output (response), and context.

LLM Evals supports the following response evaluation criteria:

QA Correctness - Whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.
Hallucinations - Designed to detect LLM hallucinations relative to retrieved context
Toxicity - Identify if the AI response is racist, biased, or toxic

Response evaluations are a critical first step to figuring out whether your LLM App is running correctly. Response evaluations can pinpoint specific executions (a.k.a. traces) that are performing badly and can be aggregated up so that you can track how your application is running as a whole.

Retrieval Evaluation

Phoenix also provides evaluation of retrieval independently.

The concept of retrieval evaluation is not new; given a set of relevance scores for a set of retrieved documents, we can evaluate retrievers using retrieval metrics like precision, NDCG, hit rate and more.

LLM Evals supports the following retrieval evaluation criteria:

Relevance - Evaluates whether a retrieved document chunk contains an answer to the query.

Retrieval is possibly the most important step in any LLM application as poor and/or incorrect retrieval can be the cause of bad response generation. If your application uses RAG to power an LLM, retrieval evals can help you identify the cause of hallucinations and incorrect answers.

Evaluations

With Phoenix's LLM Evals, evaluation results (or just Evaluations for short) is data consisting of 3 main columns:

label: str [optional] - a classification label for the evaluation (e.g. "hallucinated" vs "factual"). Can be used to calculate percentages (e.g. percent hallucinated) and can be used to filter down your data (e.g. Evals["Hallucinations"].label == "hallucinated")
score: number [optional] - a numeric score for the evaluation (e.g. 1 for good, 0 for bad). Scores are great way to sort your data to surface poorly performing examples and can be used to filter your data by a threshold.
explanation: str [optional] - the reasoning for why the evaluation label or score was given. In the case of LLM evals, this is the evaluation model's reasoning. While explanations are optional, they can be extremely useful when trying to understand problematic areas of your application.

Let's take a look at an example list of Q&A relevance evaluations:

label

explanation

score

These three columns combined can drive any type of evaluation you can imagine. label provides a way to classify responses, score provides a way to assign a numeric assessment, and explanation gives you a way to get qualitative feedback.

Evaluating Traces

With Phoenix, evaluations can be "attached" to the spans and documents collected. In order to facilitate this, Phoenix supports the following steps.

Querying and downloading data - query the spans collected by phoenix and materialize them into DataFrames to be used for evaluation (e.g. question and answer data, documents data).
Running Evaluations - the data queried in step 1 can be fed into LLM Evals to produce evaluation results.
Logging Evaluations - the evaluations performed in the above step can be logged back to Phoenix to be attached to spans and documents for evaluating responses and retrieval. See here on how to log evaluations to Phoenix.
Sorting and Filtering by Evaluation - once the evaluations have been logged back to Phoenix, the spans become instantly sortable and filterable by the evaluation values that you attached to the spans. (An example of an evaluation filter would be Eval["hallucination"].label == "hallucinated")

By following the above steps, you will have a full end-to-end flow for troubleshooting, evaluating, and root-causing an LLM application. By using LLM Evals in conjunction with Traces, you will be able to surface up problematic queries, get an explanation as to why the generation is problematic (e.g. hallucinated because ...), and be able to identify which step of your generative app requires improvement (e.g. did the LLM hallucinate or was the LLM fed bad context?).

For a full tutorial on LLM Ops, check out our tutorial below.

Eval Data Types

There are a multiple types of evaluations supported by the Phoenix Library. Each category of evaluation is categorized by its output type.

Categorical (binary) - The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.
Categorical (Multi-class) - The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.
Score - The evaluation results is a numeric value within a set range (e.g. 1-10), offering a scale of measurement.

Although score evals are an option in Phoenix, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

Categorical evals, especially multi-class, strike a balance between simplicity and the ability to convey distinct evaluative outcomes, making them more suitable for applications where precise and consistent decision-making is important.

To explore the full analysis behind our recommendation and understand the limitations of score-based evaluations, check out our research on LLM eval data types.

Evals With Explanations

It can be hard to understand in many cases why an LLM responds in a specific way. The explanation feature of Phoneix allows you to get a Eval output and an explanation from the LLM at the same time. We have found this incredibly useful for debugging LLM Evals.

from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    
)
#relevance_classifications is a Dataframe with columns 'label' and 'explanation'

The flag above can be set with any of the templates or your own custom templates. The example below is from a relevance Evaluation.

LLM as a Judge

Evaluating tasks performed by LLMs can be difficult due to their complexity and the diverse criteria involved. Traditional methods like rule-based assessment or similarity metrics (e.g., ROUGE, BLEU) often fall short when applied to the nuanced and varied outputs of LLMs.

For instance, an AI assistant’s answer to a question can be:

not grounded in context
repetitive, repetitive, repetitive
grammatically incorrect
excessively lengthy and characterized by an overabundance of words
incoherent

The list of criteria goes on. And even if we had a limited list, each of these would be hard to measure

To overcome this challenge, the concept of "LLM as a Judge" employs an LLM to evaluate another's output, combining human-like assessment with machine efficiency.

How It Works

Here’s the step-by-step process for using an LLM as a judge:

Identify Evaluation Criteria - First, determine what you want to evaluate, be it hallucination, toxicity, accuracy, or another characteristic. See our for examples of what can be assessed.
Craft Your Evaluation Prompt - Write a prompt template that will guide the evaluation. This template should clearly define what variables are needed from both the initial prompt and the LLM's response to effectively assess the output.
Select an Evaluation LLM - Choose the most suitable LLM from our available options for conducting your specific evaluations.
Generate Evaluations and View Results - Execute the evaluations across your data. This process allows for comprehensive testing without the need for manual annotation, enabling you to iterate quickly and refine your LLM's prompts.

Using an LLM as a judge significantly enhances the scalability and efficiency of the evaluation process. By employing this method, you can run thousands of evaluations across curated data without the need for human annotation.

This capability will not only speed up the iteration process for refining your LLM's prompts but will also ensure that you can deploy your models to production with confidence.

Additional Resources

Custom Task Evaluation

Customize Your Own Eval Templates

The LLM Evals library is designed to support the building of any custom Eval templates.

Steps to Building Your Own Eval

Follow the following steps to easily build your own Eval with Phoenix

1. Choose a Metric

To do that, you must identify what is the metric best suited for your use case. Can you use a pre-existing template or do you need to evaluate something unique to your use case?

2. Build a Golden Dataset

Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.

Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above)

The Eval inferences are designed or easy benchmarking and pre-set downloadable test inferences. The inferences are pre-tested, many are hand crafted and designed for testing specific Eval tasks.

3. Decide Which LLM to use For Evaluation

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

4. Build the Eval Template

Now comes the core component that we are trying to benchmark and improve: the eval template.

You can adjust an existing template or build your own from scratch.

Be explicit about the following:

What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

In order to create a new template all that is needed is the setting of the input string to the Eval function.

The above template shows an example creation of an easy to use string template. The Phoenix Eval templates support both strings and objects.

The above example shows a use of the custom created template on the df dataframe.

5. Run Eval on your Golden Dataset and Benchmark Performance

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

Retrieval & Inferences

Retrieval with Embeddings

Overview

Q&A with Retrieval at a Glance

LLM Input: User Query + retrieved document

LLM Output: Response based on query + document

Evaluation Metrics:

Did the LLM answer the question correctly (correctness)
For each retrieved document, is the document relevant to answer the user query?

Possibly the most common use-case for creating a LLM application is to connect an LLM to proprietary data such as enterprise documents or video transcriptions. Applications such as these often times are built on top of LLM frameworks such as or , which have first-class support for vector store retrievers. Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.

How to Evaluate Retrieval Systems

There are varying degrees of how we can evaluate retrieval systems.

Step 1: First we care if the chatbot is correctly answering the user's questions. Are there certain types of questions the chatbot gets wrong more often?

Step 2: Once we know there's an issue, then we need metrics to trace where specifically did it go wrong. Is the issue with retrieval? Are the documents that the system retrieves irrelevant?

Step 3: If retrieval is not the issue, we should check if we even have the right documents to answer the question.

Question

Metric

Pros

Cons

Using Phoenix Traces & Spans

Visualize the chain of the traces and spans for a Q&A chatbot use case. You can click into specific spans.

When clicking into the retrieval span, you can see the relevance score for each document. This can surface irrelevant context.

Using Phoenix Inferences to Analyze RAG (Retrieval Augmented Generation)

Step 1. Identifying Clusters of Bad Responses

Phoenix surfaces up clusters of similar queries that have poor feedback.

Step 2: Irrelevant Documents Being Retrieved

Phoenix can help uncover when irrelevant context is being retrieved using the LLM Evals for Relevance. You can look at a cluster's aggregate relevance metric with precision @k, NDCG, MRR, etc to identify where to improve. You can also look at a single prompt/response pair and see the relevance of documents.

Step 3: Don't Have Any Documents Close Enough

Phoenix can help you identify if there is context that is missing from your knowledge base. By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbots responses.

By setting the "primary" dataset as the user queries, and the "corpus" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example below.

Troubleshooting Tip:

Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Ask chatGPT to help you understand the make up of the cluster. .

Looking for code to get started? Go to our Quickstart guide for Search and Retrieval.

Benchmarking Retrieval

Benchmarking Chunk Size, K and Retrieval Approach

The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.

The core use case for RAG (Retrieval Augmented Generation) is the connecting of an LLM to private data, empower an LLM to know your data and respond based on the private data you fit into the context window.

As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.

The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.

This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.

This overview will run through the scripts in Phoenix for performance analysis of RAG setup:

The scripts above power the included notebook.

Retrieval Performance Analysis

The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.

The core issue of retrieval performance: The chunks returned might or might not be able to answer your main question. They might be semantically similar but not usable to create an answer the question!

The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?

The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.

The Evals generated on each chunk can then be used to generate more traditional search and retreival metrics for the retrieval system. We highly recommend that teams at least look at traditional search and retrieval metrics such as:

MRR
Precision @ K
NDCG

These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.

These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.

Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.

Q&A Evals

The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.

The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.

The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.

Results

The results from the runs will be available in the directory.

Underneath experiment_data there are two sets of metrics:

The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.
The second set of results is unfiltered and shows the raw metrics for every retrieval.

The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.

Retrieval Evals on Document Chunks

Retrieval Evals are designed to evaluate the effectiveness of retrieval systems. The retrieval systems typically return list of chunks of length k ordered by relevancy. The most common retrieval systems in the LLM ecosystem are vector DBs.

The retrieval Eval is designed to assess the relevance of each chunk and its ability to answer the question. More information on the Retrieval Eval can be found here

The picture above shows a single query returning chunks as a list. The retrieval Eval runs across each chunk returning a value of relevance in a list highlighting its relevance for the specific chunk. Phoenix provides helper functions that take in a dataframe, with query column that has lists of chunks and produces a column that is a list of equal length with an Eval for each chunk.

Inferences Concepts

This section introduces inferences and schemas, the starting concepts needed to use Phoenix with inferences.

For comprehensive descriptions of phoenix.Inferences and phoenix.Schema, see the API reference.
For tips on creating your own Phoenix inferences and schemas, see the how-to guide.

Inferences

Phoenix inferences are an instance of phoenix.Inferences that contains three pieces of information:

The data itself (a pandas dataframe)
A schema (a phoenix.Schema instance) that describes the columns of your dataframe
A name that appears in the UI

For example, if you have a dataframe prod_df that is described by a schema prod_schema, you can define inferences prod_ds with

prod_ds = px.Inferences(prod_df, prod_schema, "production")

If you launch Phoenix with these inferences, you will see inferences named "production" in the UI.

How many inferences do I need?

You can launch Phoenix with zero, one, or two sets of inferences.

With no inferences, Phoenix runs in the background and collects trace data emitted by your instrumented LLM application. With a single inference set, Phoenix provides insights into model performance and data quality. With two inference sets, Phoenix compares your inferences and gives insights into drift in addition to model performance and data quality, or helps you debug your retrieval-augmented generation applications.

Which inference set is which?

Your reference inferences provides a baseline against which to compare your primary inferences.

To compare two inference sets with Phoenix, you must select one inference set as primary and one to serve as a reference. As the name suggests, your primary inference set contains the data you care about most, perhaps because your model's performance on this data directly affects your customers or users. Your reference inferences, in contrast, is usually of secondary importance and serves as a baseline against which to compare your primary inferences.

Very often, your primary inferences will contain production data and your reference inferences will contain training data. However, that's not always the case; you can imagine a scenario where you want to check your test set for drift relative to your training data, or use your test set as a baseline against which to compare your production data. When choosing primary and reference inference sets, it matters less where your data comes from than how important the data is and what role the data serves relative to your other data.

Corpus Inference set (Information Retrieval)

The only difference for the corpus inferences is that it needs a separate schema because it has a different set of columns compared to the model data. See the schema section for more details.

Schemas

A Phoenix schema is an instance of phoenix.Schema that maps the columns of your dataframe to fields that Phoenix expects and understands. Use your schema to tell Phoenix what the data in your dataframe means.

For example, if you have a dataframe containing Fisher's Iris data that looks like this:

sepal_length

sepal_width

petal_length

petal_width

target

prediction

7.7

3.0

6.1

2.3

virginica

versicolor

5.4

3.9

1.7

0.4

setosa

6.3

3.3

4.7

1.6

versicolor

6.2

3.4

5.4

2.3

virginica

setosa

5.8

2.7

5.1

1.9

virginica

your schema might look like this:

schema = px.Schema(
    feature_column_names=[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
    ],
    actual_label_column_name="target",
    prediction_label_column_name="prediction",
)

How many schemas do I need?

Usually one, sometimes two.

Each inference set needs a schema. If your primary and reference inferences have the same format, then you only need one schema. For example, if you have dataframes train_df and prod_df that share an identical format described by a schema named schema, then you can define inference sets train_ds and prod_ds with

train_ds = px.Inferences(train_df, schema, "training")
prod_ds = px.Inferences(prod_df, schema, "production")

Sometimes, you'll encounter scenarios where the formats of your primary and reference inference sets differ. For example, you'll need two schemas if:

Your production data has timestamps indicating the time at which an inference was made, but your training data does not.
Your training data has ground truth (what we call actuals in Phoenix nomenclature), but your production data does not.
A new version of your model has a differing set of features from a previous version.

In cases like these, you'll need to define two schemas, one for each inference set. For example, if you have dataframes train_df and prod_df that are described by schemas train_schema and prod_schema, respectively, then you can define inference sets train_ds and prod_ds with

train_ds = px.Inferences(train_df, train_schema, "training")
prod_ds = px.Inferences(prod_df, prod_schema, "production")

Schema for Corpus Inferences (Information Retrieval)

A corpus inference set, containing documents for information retrieval, typically has a different set of columns than those found in the model data from either production or training, and requires a separate schema. Below is an example schema for a corpus inference set with three columns: the id, text, and embedding for each document in the corpus.

corpus_schema=Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
),
corpus_ds = px.Inferences(corpus_df, corpus_schema)

Resources

Frequently Asked Questions

What is the difference between Phoenix and Arize?

Arize is the company that makes Phoenix. Phoenix is an open source LLM observability tool offered by Arize. It can be access in its Cloud form online, or self-hosted and run on your own machine or server.

"Arize" can also refer to Arize's enterprise platform, often called Arize AX, available on arize.com. Arize AX is the enterprise SaaS version of Phoenix that comes with additional features like Copilot, ML and CV support, HIPAA compliance, Security Reviews, a customer success team, and more. See of the two tools.

What is my Phoenix Endpoint?

There are two endpoints that matter in Phoenix:

Application Endpoint: The endpoint your Phoenix instance is running on
OTEL Tracing Endpoint: The endpoint through which your Phoenix instance receives OpenTelemetry traces

Application Endpoint

If you're accessing a Phoenix Cloud instance through our website, then your endpoint is available under the Hostname field of your Settings page.

If you're self-hosting Phoenix, then you choose the endpoint when you set up the app. The default value is http://localhost:6006

To set this endpoint, use the PHOENIX_COLLECTOR_ENDPOINT environment variable. This is used by the Phoenix client package to query traces, log annotations, and retrieve prompts.

OTEL Tracing Endpoint

If you're accessing a Phoenix Cloud instance through our website, then your endpoint is available under the Hostname field of your Settings page.

If you're self-hosting Phoenix, then you choose the endpoint when you set up the app. The default values are:

Using the GRPC protocol: http://localhost:6006/v1/traces
Using the HTTP protocol: http://localhost:4317

As of May 2025, Phoenix Cloud only supports trace collection via HTTP

To set this endpoint, use the register(endpoint=YOUR ENDPOINT) function. This endpoint can also be set using environment variables. For more on the register function and other configuration options, .

What is LlamaTrace vs Phoenix Cloud?

LlamaTrace and Phoenix Cloud are the same tool. They are the hosted version of Phoenix provided on app.phoenix.arize.com.

Will Phoenix Cloud be on the latest version of Phoenix?

We update the Phoenix version used by Phoenix Cloud on a weekly basis.

Phoenix Cloud Migration Guide: Legacy to New Version

Learn about options to migrate your legacy Phoenix Cloud instance to the latest version

To move to the new Phoenix Cloud, simply with a different email address. From there, you can start using a new Phoenix instance immediately. Your existing projects in your old (legacy) account will remain intact and independent, ensuring a clean transition.

Since most users don’t use Phoenix Cloud for data storage, this straightforward approach works seamlessly for migrating to the latest version.

If you need to migrate data from the legacy version to the latest version, .

How to know which version of Phoenix Cloud you are on?

The easiest way to determine which version of Phoenix Cloud you’re using is by checking the URL in your browser:

The new Phoenix Cloud version will have a hostname structure like: app.arize.phoenix.com/s/[your-space-name]
If your Phoenix Cloud URL does not include /s/ followed by your space name, you are on the legacy version.

Can I add other users to my Phoenix Instance?

Self-Hosted Phoenix

Self-hosted Phoenix supports multiple user with authentication, roles, and more.

Phoenix Cloud

Phoenix Cloud is no longer limited to single-developer use—teams can manage access and share traces easily across their organization.

`The new Phoenix Cloud supports team management and collaboration. You can spin up multiple, customized Phoenix Spaces for different teams and use cases, manage individual user access and permissions for each space, and seamlessly collaborate with additional team members on your projects.

Can I use Azure OpenAI?

Yes, in fact this is probably the preferred way to interact with OpenAI if your enterprise requires data privacy. Getting the parameters right for Azure can be a bit tricky so check out the models section for details.

Can I use Phoenix locally from a remote Jupyter instance?

Yes, you can use either of the two methods below.

1. Via ngrok (Preferred)

Install pyngrok on the remote machine using the command pip install pyngrok.
on ngrok and verify your email. Find 'Your Authtoken' on the .
In jupyter notebook, after launching phoenix set its port number as the port parameter in the code below. Preferably use a default port for phoenix so that you won't have to set up ngrok tunnel every time for a new port, simply restarting phoenix will work on the same ngrok URL.
"Visit Site" using the newly printed public_url and ignore warnings, if any.

NOTE:

Ngrok free account does not allow more than 3 tunnels over a single ngrok agent session. Tackle this error by checking active URL tunnels using ngrok.get_tunnels() and close the required URL tunnel using ngrok.disconnect(public_url).

2. Via SSH

This assumes you have already set up ssh on both the local machine and the remote server.

If you are accessing a remote jupyter notebook from a local machine, you can also access the phoenix app by forwarding a local port to the remote server via ssh. In this particular case of using phoenix on a remote server, it is recommended that you use a default port for launching phoenix, say DEFAULT_PHOENIX_PORT.

Launch the phoenix app from jupyter notebook.
In a new terminal or command prompt, forward a local port of your choice from 49152 to 65535 (say 52362) using the command below. Remote user of the remote host must have sufficient port-forwarding/admin privileges.
If successful, visit to access phoenix locally.

If you are abruptly unable to access phoenix, check whether the ssh connection is still alive by inspecting the terminal. You can also try increasing the ssh timeout settings.

Closing ssh tunnel:

Simply run exit in the terminal/command prompt where you ran the port forwarding command.

How can I configure the backend to send the data to the phoenix UI in another container?

If you are working on an API whose endpoints perform RAG, but would like the phoenix server not to be launched as another thread.

You can do this by configuring the following the environment variable PHOENIX_COLLECTOR_ENDPOINT to point to the server running in a different process or container.

Can I run Phoenix on Sagemaker?

With SageMaker notebooks, phoenix leverages the to host the server under proxy/6006.Note, that phoenix will automatically try to detect that you are running in SageMaker but you can declare the notebook runtime via a parameter to launch_app or an environment variable

Can I persist data in a notebook?

You can persist data in the notebook by either setting the use_temp_dir flag to false in px.launch_app which will persist your data in SQLite on your disk at the PHOENIX_WORKING_DIR. Alternatively you can deploy a phoenix instance and point to it via PHOENIX_COLLECTOR_ENDPOINT.

What is the difference between GRPC and HTTP?

gRPC and HTTP are communication protocols used to transfer data between client and server applications.

HTTP (Hypertext Transfer Protocol) is a stateless protocol primarily used for website and web application requests over the internet.
gRPC (gRemote Procedure Call) is a modern, open-source communication protocol from Google that uses HTTP/2 for transport, protocol buffers as the interface description language, and provides features like bi-directional streaming, multiplexing, and flow control.

gRPC is more efficient in a tracing context than HTTP, but HTTP is more widely supported.

Phoenix can send traces over either HTTP or gRPC.

Can I use gRPC for trace collection?

Phoenix does natively support gRPC for trace collection post 4.0 release. See Configuration for details.

How do I resolve Phoenix Evals showing NOT_PARSABLE?

NOT_PARSABLE errors often occur when LLM responses exceed the max_tokens limit or produce incomplete JSON. Here's how to fix it:

Increase max_tokens: Update the model configuration as follows:

pythonCopy codellm_judge_model = OpenAIModel(
    api_key=getpass("Enter your OpenAI API key..."),
    model="gpt-4o-2024-08-06",
    temperature=0.2,
    max_tokens=1000,  # Increase token limit
)

Update Phoenix: Use version ≥0.17.4, which removes token limits for OpenAI and increases defaults for other APIs.
Check Logs: Look for finish_reason="length" to confirm token limits caused the issue.
If the above doesn't work, it's possible the llm-as-a-judge output might not fit into the defined rails for that particular custom Phoenix eval. Double check the prompt output matches the rail expectations.

Open Source LangSmith Alternative: Arize Phoenix vs. LangSmith

A feature comparison guide for AI engineers looking for developer-friendly LangSmith alternatives.

What is the difference between Arize Phoenix and LangSmith?

LangSmith is another LLM Observability and Evaluation platform that serves as an alternative to Arize Phoenix. Both platforms support the baseline tracing, evaluation, prompt management, and experimentation features, but there are a few key differences to be aware of:

LangSmith is closed source, while Phoenix is open source
LangSmith is part of the broader LangChain ecosystem, though it does support applications that don’t use LangChain. Phoenix is fully framework-agnostic.
Self-hosting is a paid feature within LangSmith, vs free for Phoenix.
Phoenix is backed by Arize AI. Phoenix users always have the option to graduate into Arize AX, with additional features, a customer success org, infosec team, and dedicated support. Meanwhile, Phoenix is able to focus entirely on providing the best fully open-source solution in the ecosystem.

Open vs. Closed Source

The first and most fundamental difference: LangSmith is closed source, while Phoenix is fully open source.

This means Phoenix users have complete control over how the platform is used, modified, and integrated. Whether you're running in a corporate environment with custom compliance requirements or you're building novel agent workflows, open-source tooling allows for a degree of flexibility and transparency that closed platforms simply can’t match.

LangSmith users, on the other hand, are dependent on a vendor roadmap and pricing model, with limited ability to inspect or modify the underlying system.

Ecosystem Lock-In vs. Ecosystem-Agnostic

LangSmith is tightly integrated with the LangChain ecosystem, and while it technically supports non-LangChain applications, the experience is optimized for LangChain-native workflows.

Phoenix is designed from the ground up to be framework-agnostic. It supports popular orchestration tools like LangChain, LlamaIndex, CrewAI, SmolAgents, and custom agents, thanks to its OpenInference instrumentation layer. This makes Phoenix a better choice for teams exploring multiple agent/orchestration frameworks—or who simply want to avoid vendor lock-in.

Self-Hosting: Free vs. Paid

If self-hosting is a requirement—for reasons ranging from data privacy to performance—Phoenix offers it out-of-the-box, for free. You can launch the entire platform with a single Docker container, no license keys or paywalls required.

LangSmith, by contrast, requires a paid plan to access self-hosting options. This can be a barrier for teams evaluating tools or early in their journey, especially those that want to maintain control over their data from day one.

Backed by Arize AI

Phoenix is backed by Arize AI, the leading and best-funded AI Observability provider in the ecosystem.

Arize Phoenix is intended to be a complete LLM observability solution, however for users who do not want to self-host, or who need additional features like Custom Dashboards, Copilot, Dedicated Support, or HIPAA compliance, there is a seamless upgrade path to Arize AX.

The success of Arize means that Phoenix does not need to be heavily commercialized. It can focus entirely on providing the best open-source solution for LLM Observability & Evaluation.

Feature Comparison

Feature

Arize Phoenix

Arize AX

LangSmith

Open Source

✅

Tracing

✅

Auto-Instrumentation

✅

Offline Evals

✅

Online Evals

✅

Experimentation

✅

Prompt Management

✅

Prompt Playground

✅

Run Prompts on Datasets

✅

Built-in Evaluators

✅

Agent Evaluations

✅

Human Annotations

✅

Custom Dashboards

✅

Workspaces

✅

Semantic Querying

✅

Copilot Assistant

✅

Final Thoughts

LangSmith is a strong option for teams all-in on the LangChain ecosystem and comfortable with a closed-source platform. But for those who value openness, framework flexibility, and low-friction adoption, Arize Phoenix stands out as the more accessible and extensible observability solution.

Braintrust Open Source Alternative? LLM Evaluation Platform Comparison

Dive into the difference between Braintrust and Phoenix open source LLM evaluation and tracing

Braintrust is an evaluation platform that serves as an alternative to Arize Phoenix. Both platforms support core AI application needs, such as: evaluating AI applications, prompt management, tracing executions, and experimentation. However, there are a few major differences.

Why is Arize Phoenix a popular open source alternative to Braintrust?

Braintrust is a proprietary LLM-observability platform that often hits road-blocks when AI engineers need open code, friction-free self-hosting, or things like agent tracing or online evaluation. Arize Phoenix is a fully open-source alternative that fills those gaps while remaining free to run anywhere.

Top Differences (TL;DR)

BrainTrust versus Arize Phoenix Versus Arize AX: Feature Comparison

Key Differences

Complete Ownership vs. Vendor Lock-In

Phoenix:

100% open source
Free self-hosting forever - no feature gates, no restrictions
Deploy with a single Docker container - truly "batteries included"
Your data stays on your infrastructure from day one

Braintrust:

Proprietary closed-source platform
Self-hosting locked behind paid Enterprise tier (custom pricing)
Free tier severely limited: 14-day retention, 5 users max, 1GB storage
$249/month minimum for meaningful usage ($1.50 per 1,000 scores beyond limit)

Developer-First Experience

Phoenix:

Framework agnostic - works with LangChain, LlamaIndex, DSPy, custom agents, anything
Built on OpenTelemetry/OpenInference standard - no proprietary lock-in
Auto-instrumentation that just works across ecosystems
Deploy anywhere: Docker, Kubernetes, AWS, your laptop - your choice

Braintrust:

Platform-dependent approach
Requires learning their specific APIs and workflows
Limited deployment flexibility on free/Pro tiers
Forces you into their ecosystem and pricing model

Evaluation & Observability

Phoenix:

Unlimited evaluations - run as many as you need
Pre-built evaluators: hallucination detection, toxicity, relevance, Q&A correctness
Custom evaluators with code or natural language
Human annotation capabilities built-in
Real-time tracing with full visibility into LLM applications

Braintrust:

10,000 scores on free tier ($1.50 per 1,000 additional)
50,000 scores on Pro ($249/month) - can get expensive fast
Good evaluation features, but pay-per-use model creates cost anxiety
Enterprise features locked behind custom pricing

Self-Hosting — Ease & Cost

Phoenix deploys with one Docker command and is free/unlimited to run on-prem or in the cloud. Braintrust’s self-hosting is reserved for paid enterprise plans and uses a hybrid model: the control plane (UI, metadata DB) stays in Braintrust’s cloud while you run API and storage services (Brainstore) yourself, plus extra infra wiring (note: you still pay seat / eval / retention fees, with the free tier capped at 1M spans, 10K scores, 14 days retention).

Instrumentation & Agent Tracing

Phoenix ships OpenInference—an OTel-compatible auto-instrumentation layer that captures every prompt, tool call and agent step with sub-second latency. Braintrust has 5 instrumentation options supported versus Arize Ax & Phoenix who have 50+ instrumentations.

Arize AX and Phoenix are the leaders in agent tracing solutions. Brainstrust does not trace agents today. Braintrust accepts OTel spans but has no auto-instrumentors or semantic conventions; most teams embed an SDK or proxy into their code, adding dev effort and potential latency.

Evaluation (Offline & Online)

Phoenix offers built-in and custom evaluators, “golden” datasets, and high-scale evaluation scoring (millions/day) with sampling, logs and failure debugging. Braintrust’s UI is great for prompt trials but lacks benchmarking on labeled data and has weaker online-eval debugging.

The Phoenix Evaluation library is tested against public datasets and is community supported. It is an open source tried and tested library, with millions of downloads. It has been running in production for over two years by tens of thousands of top enterprise organizations.

Human-in-the-Loop

Phoenix and Arize AX include annotation queues that let reviewers label any trace or dataset and auto-recompute metrics. Braintrust lacks queues; “Review” mode is manual and disconnected from evals

Agent Evaluation

Phoenix and AX have released extensive Agent evaluation including path evaluations, convergence evaluations and session level evaluations. The investment in research, material and technology spans over a year of work from the Arize team. Arize is the leading company thinking and working on Agent evaluation.

Open Source vs. Proprietary

One of the most fundamental differences is Phoenix’s open-source nature versus Braintrust’s proprietary approach. Phoenix is fully open source, meaning teams can inspect the code, customize the platform, and self-host it on their own infrastructure without licensing fees. This openness provides transparency and control that many organizations value. In contrast, Braintrust is a closed-source platform, which limits users’ ability to customize or extend it.

Moreover, Phoenix is built on open standards like OpenTelemetry and OpenInference for trace instrumentation. From day one, Phoenix and Arize AX have embraced open standards and open standards, ensuring compatibility with a wide range of tools and preventing vendor lock-in. Braintrust relies on its own SDK/proxy approach for logging, and does not offer the same degree of open extensibility. Its proprietary design means that while it can be integrated into apps, it ties you into Braintrust’s way of operating (and can introduce an LLM proxy layer for logging that some teams see as a potential point of latency or risk).

Teams that prioritize transparency, community-driven development, and long-term flexibility often prefer an open solution like Phoenix.

How to Choose

Prototype & iterate fast? → Phoenix (open, free, unlimited instrumentation & evals).
Scale, governance, compliance? → Arize AX (also free to start, petabyte storage, 99.9 % SLA, HIPAA, RBAC, AI-powered analytics).

Contribute to Phoenix

If you want to contribute to the cutting edge of LLM and ML Observability, you've come to the right place!

To get started, please check out the following:

Picking a GitHub Issue

We encourage you to start with an issue labeled with the tag good first issue on theGitHub issue board, to get familiar with our codebase as a first-time contributor.

Submit Your Code

To submit your code, fork the Phoenix repository, create a new branch on your fork, and open a Pull Request (PR) once your work is ready for review.

In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.

A Core reviewer will review your PR in around one business day and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github 🥳.

Your PR is now merged into Phoenix! We’ll shout out your contribution in the release notes.