Only this pageAll pages
Powered by GitBook
1 of 46

Learn

Fundamentals

Agents

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tracing

Loading...

Loading...

Loading...

Prompt Engineering

Loading...

Datasets and Experiments

Loading...

Evaluation

Loading...

Loading...

Loading...

Loading...

Loading...

Retrieval & Inferences

Loading...

Loading...

Loading...

Loading...

Resources

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Agent Workflow Patterns

Workflows are the backbone of many successful LLM applications. They define how language models interact with tools, data, and users—often through a sequence of clearly orchestrated steps. Unlike fully autonomous agents, workflows offer structure and predictability, making them a practical choice for many real-world tasks.

In this guide, we share practical workflows using a variety of agent frameworks, including:

Each section highlights how to use these tools effectively—showing what’s possible, where they shine, and where a simpler solution might serve you better. Whether you're orchestrating deterministic workflows or building dynamic agentic systems, the goal is to help you choose the right tool for your context and build with confidence.

Routing

Agent Routing is the process of directing a task, query, or request to the most appropriate agent based on context or capabilities. In multi-agent systems, it helps determine which agent is best suited to handle a specific input based on skills, domain expertise, or available tools. This enables more efficient, accurate, and specialized handling of complex tasks.

Prompt Chaining

Prompt Chaining is the technique of breaking a complex task into multiple steps, where the output of one prompt becomes the input for the next. This allows a system to reason more effectively, maintain context across steps, and handle tasks that would be too difficult to solve in a single prompt. It's often used to simulate multi-step thinking or workflows.

Parallelization

Parallelization is the process of dividing a task into smaller, independent parts that can be executed simultaneously to speed up processing. It’s used to handle multiple inputs, computations, or agent responses at the same time rather than sequentially. This improves efficiency and speed, especially for large-scale or time-sensitive tasks.

Orchestrator-workers

An orchestrator is a central controller that manages and coordinates multiple components, agents, or processes to ensure they work together smoothly.

It decides what tasks need to be done, who or what should do them, and in what order. An orchestrator can handle things like scheduling, routing, error handling, and result aggregation. It might also manage prompt chains, route tasks to agents, and oversee parallel execution.

Evaluator-Optimizer

An evaluator assesses the quality or correctness of outputs, such as ranking responses, checking for factual accuracy, or scoring performance against a metric. An optimizer uses that evaluation to improve future outputs, either by fine-tuning models, adjusting parameters, or selecting better strategies. Together, they form a feedback loop that helps a system learn what works and refine itself over time.

For a deeper dive into the principles behind agentic systems and when to use them, see .

AutoGen
CrewAI
Google GenAI SDK
OpenAI Agents
LangGraph
Anthropic’s “Building Effective Agents”

What is the difference between GRPC and HTTP?

gRPC and HTTP are communication protocols used to transfer data between client and server applications.

  • HTTP (Hypertext Transfer Protocol) is a stateless protocol primarily used for website and web application requests over the internet.

  • gRPC (gRemote Procedure Call) is a modern, open-source communication protocol from Google that uses HTTP/2 for transport, protocol buffers as the interface description language, and provides features like bi-directional streaming, multiplexing, and flow control.

gRPC is more efficient in a tracing context than HTTP, but HTTP is more widely supported.

Phoenix can send traces over either HTTP or gRPC.

Retrieval Evals on Document Chunks

Retrieval Evals are designed to evaluate the effectiveness of retrieval systems. The retrieval systems typically return list of chunks of length k ordered by relevancy. The most common retrieval systems in the LLM ecosystem are vector DBs.

The retrieval Eval is designed to asses the relevance of each chunk and its ability to answer the question. More information on the retrieval Eval can be found here

The picture above shows a single query returning k=4 chunks as a list. The retrieval Eval runs across each chunk returning a value of relevance in a list highlighting its relevance for the specific chunk. Phoenix provides helper functions that take in a dataframe, with query column that has lists of chunks and produces a column that is a list of equal length with an Eval for each chunk.

Will Phoenix Cloud be on the latest version of Phoenix?

We update the Phoenix version used by Phoenix Cloud on a weekly basis.

Can I persist data in a notebook?

You can persist data in the notebook by either setting the use_temp_dir flag to false in px.launch_app which will persist your data in SQLite on your disk at the PHOENIX_WORKING_DIR. Alternatively you can deploy a phoenix instance and point to it via PHOENIX_COLLECTOR_ENDPOINT.

How do I resolve Phoenix Evals showing NOT_PARSABLE?

NOT_PARSABLE errors often occur when LLM responses exceed the max_tokens limit or produce incomplete JSON. Here's how to fix it:

  1. Increase max_tokens: Update the model configuration as follows:

    pythonCopy codellm_judge_model = OpenAIModel(
        api_key=getpass("Enter your OpenAI API key..."),
        model="gpt-4o-2024-08-06",
        temperature=0.2,
        max_tokens=1000,  # Increase token limit
    )
  2. Update Phoenix: Use version ≥0.17.4, which removes token limits for OpenAI and increases defaults for other APIs.

  3. Check Logs: Look for finish_reason="length" to confirm token limits caused the issue.

  4. If the above doesn't work, it's possible the llm-as-a-judge output might not fit into the defined rails for that particular custom Phoenix eval. Double check the prompt output matches the rail expectations.

What is LlamaTrace vs Phoenix Cloud?

LlamaTrace and Phoenix Cloud are the same tool. They are the hosted version of Phoenix provided on app.phoenix.arize.com.

Can I use Azure OpenAI?

Yes, in fact this is probably the preferred way to interact with OpenAI if your enterprise requires data privacy. Getting the parameters right for Azure can be a bit tricky so check out the models section for details.

What is the difference between Phoenix and Arize?

Arize is the company that makes Phoenix. Phoenix is an open source LLM observability tool offered by Arize. It can be access in its Cloud form online, or self-hosted and run on your own machine or server.

Evals With Explanations

It can be hard to understand in many cases why an LLM responds in a specific way. The explanation feature of Phoneix allows you to get a Eval output and an explanation from the LLM at the same time. We have found this incredibly useful for debugging LLM Evals.

from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    
)
#relevance_classifications is a Dataframe with columns 'label' and 'explanation'

The flag above can be set with any of the templates or your own custom templates. The example below is from a relevance Evaluation.

LLM as a Judge

Evaluating tasks performed by LLMs can be difficult due to their complexity and the diverse criteria involved. Traditional methods like rule-based assessment or similarity metrics (e.g., ROUGE, BLEU) often fall short when applied to the nuanced and varied outputs of LLMs.

For instance, an AI assistant’s answer to a question can be:

  • not grounded in context

  • repetitive, repetitive, repetitive

  • grammatically incorrect

  • excessively lengthy and characterized by an overabundance of words

  • incoherent

The list of criteria goes on. And even if we had a limited list, each of these would be hard to measure

To overcome this challenge, the concept of "LLM as a Judge" employs an LLM to evaluate another's output, combining human-like assessment with machine efficiency.

How It Works

Here’s the step-by-step process for using an LLM as a judge:

  1. Craft Your Evaluation Prompt - Write a prompt template that will guide the evaluation. This template should clearly define what variables are needed from both the initial prompt and the LLM's response to effectively assess the output.

  2. Select an Evaluation LLM - Choose the most suitable LLM from our available options for conducting your specific evaluations.

  3. Generate Evaluations and View Results - Execute the evaluations across your data. This process allows for comprehensive testing without the need for manual annotation, enabling you to iterate quickly and refine your LLM's prompts.

Using an LLM as a judge significantly enhances the scalability and efficiency of the evaluation process. By employing this method, you can run thousands of evaluations across curated data without the need for human annotation.

This capability will not only speed up the iteration process for refining your LLM's prompts but will also ensure that you can deploy your models to production with confidence.

What is my Phoenix Endpoint?

There are two endpoints that matter in Phoenix:

  1. Application Endpoint: The endpoint your Phoenix instance is running on

  2. OTEL Tracing Endpoint: The endpoint through which your Phoenix instance receives OpenTelemetry traces

Application Endpoint

If you're accessing a Phoenix Cloud instance through our website, then your endpoint is https://app.phoenix.arize.com

If you're self-hosting Phoenix, then you choose the endpoint when you set up the app. The default value is http://localhost:6006

To set this endpoint, use the PHOENIX_COLLECTOR_ENDPOINT environment variable. This is used by the Phoenix client package to query traces, log annotations, and retrieve prompts.

OTEL Tracing Endpoint

If you're accessing a Phoenix Cloud instance through our website, then your OTEL tracing endpoint is https://app.phoenix.arize.com/v1/traces

If you're self-hosting Phoenix, then you choose the endpoint when you set up the app. The default values are:

  • Using the GRPC protocol: http://localhost:6006/v1/traces

  • Using the HTTP protocol: http://localhost:4317

As of May 2025, Phoenix Cloud only supports trace collection via HTTP

Can I run Phoenix on Sagemaker?

import os

os.environ["PHOENIX_NOTEBOOK_ENV"] = "sagemaker"

Custom Task Evaluation

Customize Your Own Eval Templates

The LLM Evals library is designed to support the building of any custom Eval templates.

Steps to Building Your Own Eval

Follow the following steps to easily build your own Eval with Phoenix

1. Choose a Metric

To do that, you must identify what is the metric best suited for your use case. Can you use a pre-existing template or do you need to evaluate something unique to your use case?

2. Build a Golden Dataset

Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.

Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above)

The Eval inferences are designed or easy benchmarking and pre-set downloadable test inferences. The inferences are pre-tested, many are hand crafted and designed for testing specific Eval tasks.

3. Decide Which LLM to use For Evaluation

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

4. Build the Eval Template

Now comes the core component that we are trying to benchmark and improve: the eval template.

You can adjust an existing template or build your own from scratch.

Be explicit about the following:

  • What is the input? In our example, it is the documents/context that was retrieved and the query from the user.

  • What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query

  • What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

In order to create a new template all that is needed is the setting of the input string to the Eval function.

The above template shows an example creation of an easy to use string template. The Phoenix Eval templates support both strings and objects.

The above example shows a use of the custom created template on the df dataframe.

5. Run Eval on your Golden Dataset and Benchmark Performance

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

Can I add other users to my Phoenix Instance?

Currently Phoenix Cloud accounts are setup to be used specifically for one developer. We will be adding ways to share your traces with other developers on your team shortly!

How Tracing Works

The components behind tracing

Instrumentation

Exporter

An exporter takes the spans created via instrumentation and exports them to a collector. In simple terms, it just sends the data to the Phoenix. When using Phoenix, most of this is completely done under the hood when you call instrument on an instrumentor.

Collector

The Phoenix server is a collector and a UI that helps you troubleshoot your application in real time. When you run or run phoenix (e.x. px.launch_app(), container), Phoenix starts receiving spans from any application(s) that is exporting spans to it.

OpenTelememetry Protocol

OpenTelemetetry Protocol (or OTLP for short) is the means by which traces arrive from your application to the Phoenix collector. Phoenix currently supports OTLP over HTTP.

How can I configure the backend to send the data to the phoenix UI in another container?

If you are working on an API whose endpoints perform RAG, but would like the phoenix server not to be launched as another thread.

Contribute to Phoenix

If you want to contribute to the cutting edge of LLM and ML Observability, you've come to the right place!

To get started, please check out the following:

Picking a GitHub Issue

Submit Your Code

In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.

A Core reviewer will review your PR in around one business day and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github 🥳.

Your PR is now merged into Phoenix! We’ll shout out your contribution in the release notes.

Eval Data Types

There are a multiple types of evaluations supported by the Phoenix Library. Each category of evaluation is categorized by its output type.

  • Categorical (binary) - The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.

  • Categorical (Multi-class) - The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.

  • Score - The evaluation results is a numeric value within a set range (e.g. 1-10), offering a scale of measurement.

Although score evals are an option in Phoenix, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

Categorical evals, especially multi-class, strike a balance between simplicity and the ability to convey distinct evaluative outcomes, making them more suitable for applications where precise and consistent decision-making is important.

Can I use gRPC for trace collection?

"Arize" can also refer to Arize's enterprise platform, often called Arize AX, available on arize.com. Arize AX is the enterprise SaaS version of Phoenix that comes with additional features like Copilot, ML and CV support, HIPAA compliance, Security Reviews, a customer success team, and more. See of the two tools.

Identify Evaluation Criteria - First, determine what you want to evaluate, be it hallucination, toxicity, accuracy, or another characteristic. See our for examples of what can be assessed.

To set this endpoint, use the register(endpoint=YOUR ENDPOINT) function. This endpoint can also be set using environment variables. For more on the register function and other configuration options, .

With SageMaker notebooks, phoenix leverages the to host the server under proxy/6006.Note, that phoenix will automatically try to detect that you are running in SageMaker but you can declare the notebook runtime via a parameter to launch_app or an environment variable

supports multiple user with , roles, and more.

In order for an application to emit traces for analysis, the application must be instrumented. Your application can be manually instrumented or be automatically instrumented. With phoenix, there a set of plugins (instrumentors) that can be added to your application's startup process that perform auto-instrumentation. These plugins collect spans for your application and export them for collection and visualization. For phoenix, all the instrumentors are managed via a single repository called . The comprehensive list of instrumentors can be found in the how-to guide.

You can do this by configuring the following the variable PHOENIX_COLLECTOR_ENDPOINT to point to the server running in a different process or container.

We encourage you to start with an issue labeled with the tag on theGitHub issue board, to get familiar with our codebase as a first-time contributor.

To submit your code, , create a on your fork, and open once your work is ready for review.

To explore the full analysis behind our recommendation and understand the limitations of score-based evaluations, check out on LLM eval data types.

Phoenix does natively support gRPC for trace collection post 4.0 release. See for details.

here for a breakdown
pre-built evaluators
see here
jupyter-server-proy
from phoenix.evals import download_benchmark_dataset

df = download_benchmark_dataset(
    task="binary-hallucination-classification", dataset_name="halueval_qa_data"
)
df.head()
MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''
model = OpenAIModel(model_name="gpt-4",temperature=0.6)
positive_eval = llm_classify(
    dataframe=df,
    template= MY_CUSTOM_TEMPLATE,
    model=model
)
#Phoenix Evals support using either strings or objects as templates
MY_CUSTOM_TEMPLATE = " ..."
MY_CUSTOM_TEMPLATE = PromptTemplate("This is a test {prompt}")
Self-hosted Phoenix
authentication
OpenInference
environment
Our development guide
Code of conduct
Contribution License Agreement
good first issue
fork the Phoenix repository
new branch
a Pull Request (PR)
our research
Configuration

AutoGen

Use Phoenix to trace and evaluate AutoGen agents

Phoenix can be used to trace AutoGen agents by instrumenting their workflows, allowing you to visualize agent interactions, message flows, and performance metrics across multi-agent chains.


AutoGen Core Concepts

  • UserProxyAgent: Acts on behalf of the user to initiate tasks, guide the conversation, and relay feedback between agents. It can operate in auto or human-in-the-loop mode and control the flow of multi-agent interactions.

  • AssisstantAgent: Performs specialized tasks such as code generation, review, or analysis. It supports role-specific prompts, memory of prior turns, and can be equipped with tools to enhance its capabilities.

  • GroupChat: Coordinates structured, turn-based conversations among multiple agents. It maintains shared context, controls agent turn-taking, and stops the chat when completion criteria are met.

  • GroupChatManager: Manages the flow and logic of the GroupChat, including termination rules, turn assignment, and optional message routing customization.

  • Tool Integration: Agents can use external tools (e.g. Python, web search, RAG retrievers) to perform actions beyond text generation, enabling more grounded or executable outputs.

  • Memory and Context Tracking: Agents retain and access conversation history, enabling coherent and stateful dialogue over multiple turns.


Design Considerations and Limitations

Design Consideration
Limitations

Agent Roles

Poorly defined responsibilities can cause overlap or miscommunication, especially between multi-agent workflows.

Termination Conditions

GroupChat may continue even after a logical end, as UserProxyAgent can exhaust all allowed turns before stopping unless termination is explicitly triggered.

Human-in-the-Loop

Fully autonomous mode may miss important judgment calls without user oversight.

State Management

Excessive context can exceed token limits, while insufficient context breaks coherence.


Prompt Chaining

Prompt chaining is a method where a complex task is broken into smaller, linked subtasks, with the output of one step feeding into the next. This workflow is ideal when a task can be cleanly decomposed into fixed subtasks, making each LLM call simpler and more accurate — trading off latency for better overall performance.

AutoGen makes it easy to build these chains by coordinating multiple agents. Each AssistantAgent focuses on a specialized task, while a UserProxyAgent manages the conversation flow and passes key outputs between steps. With Phoenix tracing, we can visualize the entire sequence, monitor individual agent calls, and debug the chain easily.

Notebook: Market Analysis Prompt Chaining Agent The agent conducts a multi-step market analysis workflow, starting with identifying general trends and culminating in an evaluation of company strengths.

How to evaluate: Ensure outputs are moved into inputs for the next step and logically build across steps (e.g., do identified trends inform the company evaluation?)

  • Confirm that each prompt step produces relevant and distinct outputs that contribute to the final analysis

  • Track total latency and token counts to see which steps cause inefficiencies

  • Ensure there are no redundant outputs or hallucinations in multi-step reasoning


Routing

Routing is a pattern designed to handle incoming requests by classifying them and directing them to the single most appropriate specialized agent or workflow.

AutoGen simplifies implementing this pattern by enabling a dedicated 'Router Agent' to analyze incoming messages and signal its classification decision. Based on this classification, the workflow explicitly directs the query to the appropriate specialist agent for a focused, separate interaction. The specialist agent is equipped with tools to carry out the request.

Notebook: Customer Service Routing Agent We will build an intelligent customer service system, designed to efficiently handle diverse user queries directing them to a specialized AssistantAgent .

How to evaluate: Ensure the Router Agent consistently classifies incoming queries into the correct category (e.g., billing, technical support, product info)

  • Confirm that each query is routed to the appropriate specialized AssistantAgent without ambiguity or misdirection

  • Test with edge cases and overlapping intents to assess the router’s ability to disambiguate accurately

  • Watch for routing failures, incorrect classifications, or dropped queries during handoff between agents


Evaluator–Optimizer Loop

The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement. This approach trades increased interactions for a more polished & accurate final result.

AutoGen's GroupChat architecture is good for implementing this pattern because it can manage the conversational turns between the generator and evaluator agents. The GroupChatManager facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.

Notebook: Code Generator with Evaluation Loop We'll use a Code_Generator agent to write Python code from requirements, and a Code_Reviewer agent to assess it for correctness, style, and documentation. This iterative GroupChat process improves code quality through a generation and review loop.

How to evaluate: Ensure the evaluator provides specific, actionable feedback aligned with criteria (e.g., correctness, style, documentation)

  • Confirm that the generator incorporates feedback into meaningful revisions with each iteration

  • Track the number of iterations required to reach an acceptable or final version to assess efficiency

  • Watch for repetitive feedback loops, regressions, or ignored suggestions that signal breakdowns in the refinement process


Orchestrator Pattern

Orchestration enables collaboration among multiple specialized agents, activating only the most relevant one based on the current subtask context. Instead of relying on a fixed sequence, agents dynamically participate depending on the state of the conversation.

Agent orchestrator workflows simplifies this routing pattern through a central orchestrator (GroupChatManager) that selectively delegates tasks to the appropriate agents. Each agent monitors the conversation but only contributes when their specific expertise is required.

Notebook: Trip Planner Orchestrator Agent We will build a dynamic travel planning assistant. A GroupChatManager coordinates specialized agents to adapt to the user's evolving travel needs. How to evaluate: Ensure the orchestrator activates only relevant agents based on the current context or user need. (e.g., flights, hotels, local activities)

  • Confirm that agents contribute meaningfully and only when their domain expertise is required

  • Track the conversation flow to verify smooth handoffs and minimal overlap or redundancy among agents

  • Test with evolving and multi-intent queries to assess the orchestrator’s ability to adapt and reassign tasks dynamically


Parallel Agent Execution

Parallelization is a powerful agent pattern where multiple tasks are run concurrently, significantly speeding up the overall process. Unlike purely sequential workflows, this approach is suitable when tasks are independent and can be processed simultaneously.

AutoGen doesn't have a built-in parallel execution manager, but its core agent capabilities integrate seamlessly with standard Python concurrency libraries. We can use these libraries to launch multiple agent interactions concurrently.

Notebook: Product Description Parallelization Agent We'll generate different components of a product description for a smartwatch (features, value proposition, target customer, tagline) by calling a marketing agent. At the end, results are synthesized together.

How to evaluate: Ensure each parallel agent call produces a distinct and relevant component (e.g., features, value proposition, target customer, tagline)

  • Confirm that all outputs are successfully collected and synthesized into a cohesive final product description

  • Track per-task runtime and total execution time to measure parallel speedup vs. sequential execution

  • Test with varying product types to assess generality and stability of the parallel workflow

Frequently Asked Questions

Retrieval with Embeddings

Overview

Q&A with Retrieval at a Glance

LLM Input: User Query + retrieved document

LLM Output: Response based on query + document

Evaluation Metrics:

  1. Did the LLM answer the question correctly (correctness)

  2. For each retrieved document, is the document relevant to answer the user query?

How to Evaluate Retrieval Systems

There are varying degrees of how we can evaluate retrieval systems.

Step 1: First we care if the chatbot is correctly answering the user's questions. Are there certain types of questions the chatbot gets wrong more often?

Step 2: Once we know there's an issue, then we need metrics to trace where specifically did it go wrong. Is the issue with retrieval? Are the documents that the system retrieves irrelevant?

Step 3: If retrieval is not the issue, we should check if we even have the right documents to answer the question.

Using Phoenix Traces & Spans

Visualize the chain of the traces and spans for a Q&A chatbot use case. You can click into specific spans.

When clicking into the retrieval span, you can see the relevance score for each document. This can surface irrelevant context.

Using Phoenix Inferences to Analyze RAG (Retrieval Augmented Generation)

Step 1. Identifying Clusters of Bad Responses

Phoenix surfaces up clusters of similar queries that have poor feedback.

Step 2: Irrelevant Documents Being Retrieved

Phoenix can help uncover when irrelevant context is being retrieved using the LLM Evals for Relevance. You can look at a cluster's aggregate relevance metric with precision @k, NDCG, MRR, etc to identify where to improve. You can also look at a single prompt/response pair and see the relevance of documents.

Step 3: Don't Have Any Documents Close Enough

Phoenix can help you identify if there is context that is missing from your knowledge base. By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbots responses.

By setting the "primary" dataset as the user queries, and the "corpus" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example below.

Troubleshooting Tip:

Looking for code to get started? Go to our Quickstart guide for Search and Retrieval.

Benchmarking Retrieval

Benchmarking Chunk Size, K and Retrieval Approach

The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.

The core use case for RAG (Retrieval Augmented Generation) is the connecting of an LLM to private data, empower an LLM to know your data and respond based on the private data you fit into the context window.

As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.

The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.

This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.

This overview will run through the scripts in phoenix for performance analysis of RAG setup:

The scripts above power the included notebook.

Retrieval Performance Analysis

The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.

The core issue of retrieval performance: The chunks returned might or might not be able to answer your main question. They might be semantically similar but not usable to create an answer the question!

The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?

The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.

The Evals generated on each chunk can then be used to generate more traditional search and retreival metrics for the retrieval system. We highly recommend that teams at least look at traditional search and retrieval metrics such as:\

  • MRR

  • Precision @ K

  • NDCG

These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.

These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.

Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.

Q&A Evals

The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.

The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.

The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.

Results

The results from the runs will be available in the directory:

experiment_data/

Underneath experiment_data there are two sets of metrics:

The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.

experiment_data/results_zero_removed

The second set of results is unfiltered and shows the raw metrics for every retrieval.

experiment_data/results_zero_not_removed

The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.

The above graphs show MRR results across a sweep of different chunk sizes.

Can I use Phoenix locally from a remote Jupyter instance?

Yes, you can use either of the two methods below.

1. Via ngrok (Preferred)

  • Install pyngrok on the remote machine using the command pip install pyngrok.

  • In jupyter notebook, after launching phoenix set its port number as the port parameter in the code below. Preferably use a default port for phoenix so that you won't have to set up ngrok tunnel every time for a new port, simply restarting phoenix will work on the same ngrok URL.

  • "Visit Site" using the newly printed public_url and ignore warnings, if any.

NOTE:

Ngrok free account does not allow more than 3 tunnels over a single ngrok agent session. Tackle this error by checking active URL tunnels using ngrok.get_tunnels() and close the required URL tunnel using ngrok.disconnect(public_url).

2. Via SSH

This assumes you have already set up ssh on both the local machine and the remote server.

If you are accessing a remote jupyter notebook from a local machine, you can also access the phoenix app by forwarding a local port to the remote server via ssh. In this particular case of using phoenix on a remote server, it is recommended that you use a default port for launching phoenix, say DEFAULT_PHOENIX_PORT.

  • Launch the phoenix app from jupyter notebook.

  • In a new terminal or command prompt, forward a local port of your choice from 49152 to 65535 (say 52362) using the command below. Remote user of the remote host must have sufficient port-forwarding/admin privileges.

If you are abruptly unable to access phoenix, check whether the ssh connection is still alive by inspecting the terminal. You can also try increasing the ssh timeout settings.

Closing ssh tunnel:

Simply run exit in the terminal/command prompt where you ran the port forwarding command.

The phoenix server is collector of traces over OTLP

is an open-source framework by Microsoft for building multi-agent workflows. The AutoGen agent framework provides tools to define, manage, and orchestrate agents, including customizable behaviors, roles, and communication protocols.

Possibly the most common use-case for creating a LLM application is to connect an LLM to proprietary data such as enterprise documents or video transcriptions. Applications such as these often times are built on top of LLM frameworks such as or , which have first-class support for vector store retrievers. Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.

Question
Metric
Pros
Cons

Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Ask chatGPT to help you understand the make up of the cluster. .

on ngrok and verify your email. Find 'Your Authtoken' on the .

If successful, visit to access phoenix locally.

AutoGen
What is the difference between Phoenix and Arize?
What is my Phoenix Endpoint?
What is LlamaTrace vs Phoenix Cloud?
Langfuse alternative? Arize Phoenix vs Langfuse
LangSmith alternative? Arize Phoenix vs LangSmith
Will Phoenix Cloud be on the latest version of Phoenix?
Can I add other users to my Phoenix Instance?
Can I use Azure OpenAI?
Can I use Phoenix locally from a remote Jupyter instance?
How can I configure the backend to send the data to the phoenix UI in another container?
Can I persist data in a notebook?
What is the difference between GRPC and HTTP?
Can I use gRPC for trace collection?
How do I resolve Phoenix Evals showing NOT_PARSABLE?

Is this a bad response to the answer?

User feedback or LLM Eval for Q&A

Most relevant way to measure application

Hard to trace down specifically what to fix

Is the retrieved context relevant?

LLM Eval for Relevance

Directly measures effectiveness of retrieval

Requires additional LLMs calls

Is the knowledge base missing areas of user queries?

Query density (drift) - Phoenix generated

Highlights groups of queries with large distance from context

Identifies broad topics missing from knowledge base, but not small gaps

import getpass
from pyngrok import ngrok, conf
print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = getpass.getpass()
port = 37689
# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(port).public_url
print(" * ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}\"".format(public_url, port))
ssh -L 52362:localhost:<DEFAULT_PHOENIX_PORT> <REMOTE_USER>@<REMOTE_HOST>
Langchain
llama_index
Try out the colab here
Create a free account
dashboard
localhost:52362

CrewAI

Use Phoenix to trace and evaluate different CrewAI agent patterns

Core Concepts of CrewAI

Agents

Agents are autonomous, role-driven entities designed to perform specific functions—like a Researcher, Writer, or Support Rep. They can be richly customized with goals, backstories, verbosity settings, delegation permissions, and access to tools. This flexibility makes agents expressive and task-aware, helping model real-world team dynamics.

Tasks

Tasks are the atomic units of work in CrewAI. Each task includes a description, expected output, responsible agent, and optional tools. Tasks can be executed solo or collaboratively, and they serve as the bridge between high-level goals and actionable steps.

Tools

Tools give agents capabilities beyond language generation—such as browsing the web, fetching documents, or performing calculations. Tools can be native or developer-defined using the BaseTool class, and each must have a clear name and purpose so agents can invoke them appropriately.Tools must include clear descriptions to help agents use them effectively.

Processes

CrewAI supports multiple orchestration strategies:

  • Sequential: Tasks run in a fixed order—simple and predictable.

  • Hierarchical: A manager agent or LLM delegates tasks dynamically, enabling top-down workflows.

  • Consensual (planned): Future support for democratic, collaborative task routing. Each process type shapes how coordination and delegation unfold within a crew.

Crews

A crew is a collection of agents and tasks governed by a defined process. It represents a fully operational unit with an execution strategy, internal collaboration logic, and control settings for verbosity and output formatting. Think of it as the operating system for multi-agent workflows.

Pipelines

Pipelines chain multiple crews together, enabling multi-phase workflows where the output of one crew becomes the input to the next. This allows developers to modularize complex applications into reusable, composable segments of logic.

Planning

With planning enabled, CrewAI generates a task-by-task strategy before execution using an AgentPlanner. This enriches each task with context and sequencing logic, improving coordination—especially in multi-step or loosely defined workflows.


Design Considerations and Limitations

Design Considerations
Features & Limitations

Agent Roles

Explicit role configuration gives flexibility, but poor design can cause overlap or miscommunication

State Management

Stateless by default. Developers must implement external state or context passing for continuity across tasks

Task Planning

Supports sequential and branching workflows, but all logic must be manually defined—no built-in planning

Tool Usage

Agents support tools via config. No automatic selection; all tool-to-agent mappings are manual

Termination Logic

No auto-termination handling. Developers must define explicit conditions to break recursive or looping behavior

Memory

No built-in memory layer. Integration with vector stores or databases must be handled externally


Agent Design Patterns

Prompt Chaining

Prompt chaining decomposes a complex task into a sequence of smaller steps, where each LLM call operates on the output of the previous one. This workflow introduces the ability to add programmatic checks (such as “gates”) between steps, validating intermediate outputs before continuing. The result is higher control, accuracy, and debuggability—at the cost of increased latency.

CrewAI makes it straightforward to build prompt chaining workflows using a sequential process. Each step is modeled as a Task, assigned to a specialized Agent, and executed in order using Process.sequential. You can insert validation logic between tasks or configure agents to flag issues before passing outputs forward.

Notebook: Research-to-Content Prompt Chaining Workflow

Routing

Routing is a pattern designed to classify incoming requests and dispatch them to the single most appropriate specialist agent or workflow, ensuring each input is handled by a focused, expert-driven routine.

In CrewAI, you implement routing by defining a Router Agent that inspects each input, emits a category label, and then dynamically delegates to downstream agents (or crews) tailored for that category—each equipped with its own tools and prompts. This separation of concerns delivers more accurate, maintainable pipelines.

Notebook: Research-Content Routing Workflow

Parallelization

Parallelization is a powerful agent workflow where multiple tasks are executed simultaneously, enabling faster and more scalable LLM pipelines. This pattern is particularly effective when tasks are independent and don’t depend on each other’s outputs.

While CrewAI does not enforce true multithreaded execution, it provides a clean and intuitive structure for defining parallel logic through multiple agents and tasks. These can be executed concurrently in terms of logic, and then gathered or synthesized by a downstream agent.

Notebook: Parallel Research Agent

Orchestrator-Workers

The Orchestrator-Workers workflow centers around a primary agent—the orchestrator—that dynamically decomposes a complex task into smaller, more manageable subtasks. Rather than relying on a fixed structure or pre-defined subtasks, the orchestrator decides what needs to be done based on the input itself. It then delegates each piece to the most relevant worker agent, often specialized in a particular domain like research, content synthesis, or evaluation.

CrewAI supports this pattern using the Process.hierarchical setup, where the orchestrator (as the manager agent) generates follow-up task specifications at runtime. This enables dynamic delegation and coordination without requiring the workflow to be rigidly structured up front. It's especially useful for use cases like multi-step research, document generation, or problem-solving workflows where the best structure only emerges after understanding the initial query.

Notebook: Research & Writing Delegation Agents

LangGraph

Use Phoenix to trace and evaluate agent frameworks built using Langgraph

This guide explains key LangGraph concepts, discusses design considerations, and walks through common architectural patterns like orchestrator-worker, evaluators, and routing. Each pattern includes a brief explanation and links to runnable Python notebooks.

Core LangGraph Concepts

LangGraph allows you to build LLM-powered applications using a graph of steps (called "nodes") and data (called "state"). Here's what you need to know to understand and customize LangGraph workflows:

State

A TypedDict that stores all information passed between nodes. Think of it as the memory of your workflow. Each node can read from and write to the state.

Nodes

Nodes are units of computation. Most often these are functions that accept a State input and return a partial update to it. Nodes can do anything: call LLMs, trigger tools, perform calculations, or prompt users.

Edges

Directed connections that define the order in which nodes are called. LangGraph supports linear, conditional, and cyclical edges, which allows for building loops, branches, and recovery flows.

Conditional Routing

A Python function that examines the current state and returns the name of the next node to call. This allows your application to respond dynamically to LLM outputs, tool results, or even human input.

Send API

A way to dynamically launch multiple workers (nodes or subgraphs) in parallel, each with their own state. Often used in orchestrator-worker patterns where the orchestrator doesn't know how many tasks there will be ahead of time.

Agent Supervision

LangGraph enables complex multi-agent orchestration using a Supervisor node that decides how to delegate tasks among a team of agents. Each agent can have its own tools, prompt structure, and output format. The Supervisor coordinates routing, manages retries, and ensures loop control.

Checkpointing and Persistence

LangGraph supports built-in persistence using checkpointing. Each execution step saves state to a database (in-memory, SQLite, or Postgres). This allows for:

  • Multi-turn conversations (memory)

  • Rewinding to past checkpoints (time travel)

  • Human-in-the-loop workflows (pause + resume)

Design Considerations & Limitations

LangGraph improves on LangChain by supporting more flexible and complex workflows. Here’s what to keep in mind when designing:

Benefits
Limitations

Cyclic workflows: LangGraph supports loops, retries, and iterative workflows that would be cumbersome in LangChain.

Debugging complexity: Deep graphs and multi-agent networks can be difficult to trace. Use Arize AX or Phoenix!

Fine-grained control: Customize prompts, tools, state updates, and edge logic for each node.

Token bloat: Cycles and retries can accumulate state and inflate token usage.

Visualize: Graph visualization makes it easier to follow logic flows and complex routing.

Requires upfront design: Graphs must be statically defined before execution. No dynamic graph construction mid-run.

Supports multi-agent coordination: Easily create agent networks with Supervisor and worker roles.

Supervisor misrouting: If not carefully designed, supervisors may loop unnecessarily or reroute outputs to the wrong agent.

Patterns

Prompt Chaining

A linear sequence of prompt steps, where the output of one becomes the input to the next. This workflow is optimal when the task can be simply broken down into concrete subtasks.

Use case: Multistep reasoning, query rewriting, or building up answers gradually.

Parallelization

Runs multiple LLMs in parallel — either by splitting tasks (sectioning) or getting multiple opinions (voting).

Use case: Combining diverse outputs, evaluating models from different angles, or running safety checks.

With the Send API, LangGraph lets you:

  • Launch multiple safety evaluators in parallel

  • Compare multiple generated hypotheses side-by-side

  • Run multi-agent voting workflows

This improves reliability and reduces bottlenecks in linear pipelines.

Router

Routes an input to the most appropriate follow-up node based on its type or intent.

Use case: Customer support bots, intent classification, or model selection.

LangGraph routers enable domain-specific delegation — e.g., classify an incoming query as "billing", "technical support", or "FAQ", and send it to a specialized sub-agent. Each route can have its own tools, memory, and context. Use structured output with a routing schema to make classification more reliable.

Evaluator–Optimizer Loop

One LLM generates content, another LLM evaluates it, and the loop repeats until the evaluation passes. LangGraph allows feedback to modify the state, making each round better than the last.

Use case: Improving code, jokes, summaries, or any generative output with measurable quality.

Orchestrator–Worker

An orchestrator node dynamically plans subtasks and delegates each to a worker LLM. Results are then combined into a final output.

Use case: Writing research papers, refactoring code, or composing modular documents.

LangGraph’s Send API lets the orchestrator fork off tasks (e.g., subsections of a paper) and gather them into completed_sections. This is especially useful when the number of subtasks isn’t known in advance.

You can also incorporate agents like PDF_Reader or a WebSearcher, and the orchestrator can choose when to route to these workers.

⚠️ Caution: Feedback loops or improper edge handling can cause workers to echo each other or create infinite loops. Use strict conditional routing to avoid this.

Evaluators

Phoenix offers key modules to measure the quality of generated results as well as modules to measure retrieval quality.

  • Response Evaluation: Does the response match the retrieved context? Does it also match the query?

  • Retrieval Evaluation: Are the retrieved sources relevant to the query?

Response Evaluation

Evaluation of generated results can be challenging. Unlike traditional ML, the predicted results are not numeric or categorical, making it hard to define quantitative metrics for this problem.

LLM Evals supports the following response evaluation criteria:

  • QA Correctness - Whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

  • Hallucinations - Designed to detect LLM hallucinations relative to retrieved context

  • Toxicity - Identify if the AI response is racist, biased, or toxic

Response evaluations are a critical first step to figuring out whether your LLM App is running correctly. Response evaluations can pinpoint specific executions (a.k.a. traces) that are performing badly and can be aggregated up so that you can track how your application is running as a whole.

Retrieval Evaluation

Phoenix also provides evaluation of retrieval independently.

The concept of retrieval evaluation is not new; given a set of relevance scores for a set of retrieved documents, we can evaluate retrievers using retrieval metrics like precision, NDCG, hit rate and more.

LLM Evals supports the following retrieval evaluation criteria:

  • Relevance - Evaluates whether a retrieved document chunk contains an answer to the query.

Retrieval is possibly the most important step in any LLM application as poor and/or incorrect retrieval can be the cause of bad response generation. If your application uses RAG to power an LLM, retrieval evals can help you identify the cause of hallucinations and incorrect answers.

Evaluations

With Phoenix's LLM Evals, evaluation results (or just Evaluations for short) is data consisting of 3 main columns:

  • label: str [optional] - a classification label for the evaluation (e.g. "hallucinated" vs "factual"). Can be used to calculate percentages (e.g. percent hallucinated) and can be used to filter down your data (e.g. Evals["Hallucinations"].label == "hallucinated")

  • score: number [optional] - a numeric score for the evaluation (e.g. 1 for good, 0 for bad). Scores are great way to sort your data to surface poorly performing examples and can be used to filter your data by a threshold.

  • explanation: str [optional] - the reasoning for why the evaluation label or score was given. In the case of LLM evals, this is the evaluation model's reasoning. While explanations are optional, they can be extremely useful when trying to understand problematic areas of your application.

Let's take a look at an example list of Q&A relevance evaluations:

label
explanation
score

correct

The reference text explains that YC was not or...

1

correct

To determine if the answer is correct, we need...

1

incorrect

To determine if the answer is correct, we must...

0

correct

To determine if the answer is correct, we need...

1

These three columns combined can drive any type of evaluation you can imagine. label provides a way to classify responses, score provides a way to assign a numeric assessment, and explanation gives you a way to get qualitative feedback.

Evaluating Traces

With Phoenix, evaluations can be "attached" to the spans and documents collected. In order to facilitate this, Phoenix supports the following steps.

  1. Querying and downloading data - query the spans collected by phoenix and materialize them into DataFrames to be used for evaluation (e.g. question and answer data, documents data).

  2. Running Evaluations - the data queried in step 1 can be fed into LLM Evals to produce evaluation results.

  3. Logging Evaluations - the evaluations performed in the above step can be logged back to Phoenix to be attached to spans and documents for evaluating responses and retrieval. See here on how to log evaluations to Phoenix.

  4. Sorting and Filtering by Evaluation - once the evaluations have been logged back to Phoenix, the spans become instantly sortable and filterable by the evaluation values that you attached to the spans. (An example of an evaluation filter would be Eval["hallucination"].label == "hallucinated")

By following the above steps, you will have a full end-to-end flow for troubleshooting, evaluating, and root-causing an LLM application. By using LLM Evals in conjunction with Traces, you will be able to surface up problematic queries, get an explanation as to why the the generation is problematic (e.x. hallucinated because ...), and be able to identify which step of your generative app requires improvement (e.x. did the LLM hallucinate or was the LLM fed bad context?).\

For a full tutorial on LLM Ops, check out our tutorial below.

Prompts Concepts

Prompt

Prompts often times refer to the content of how you "prompt" a LLM, e.g. the "text" that you send to a model like OpenAI's gpt-4. Within Phoenix we expand this definition to be everything that's needed to prompt:

  • The prompt template of the messages to send to a completion endpoint

  • The invocation parameters (temperature, frequency penalty, etc.)

  • The tools made accessible to the LLM (e.x. weather API)

  • The response format (sometimes called the output schema) used for when you have JSON mode enabled.

This expanded definition of a prompt lets you more deterministically invoke LLMs with confidence as everything is snapshotted for you to use within your application.

Prompt Templates

Although the terms prompt and prompt template get used interchangeably, it's important to know the difference.

Prompts refer to the message(s) that are passed into the language model.

Prompt Templates refer a way of formatting information to get the prompt to hold the information you want (such as context and examples) Prompt templates can include placeholders (variables) for things such as examples (e.x. few-shot), outside context (RAG), or any other external data that is needed.

Prompt Version

Every time you save a prompt within Phoenix, a snapshot of the prompt is saved as a prompt version. Phoenix does this so that you not only can view the changes to a prompt over time but also so that you can build confidence about a specific prompt version before using it within your application. With every prompt version phoenix tracks the author of the prompt and the date at which the version was saved.

Similar to the way in which you can track changes to your code via git shas, Phoenix tracks each change to your prompt with a prompt_id.

Prompt Version Tag

Imagine you’re working on a AI project, and you want to label specific versions of your prompts so you can control when and where they get deployed. This is where prompt version tags come in.

A prompt version tag is like a sticky note you put on a specific version of your prompt to mark it as important. Once tagged, that version won’t change, making it easy to reference later.

When building applications, different environments are often used for different stages of readiness before going live, for example:

  1. Development – Where new features are built.

  2. Staging – Where testing happens.

  3. Production – The live system that users interact with.

Tagging prompt versions with environment tags can enable building, testing, and deploying prompts in the same way as an application—ensuring that prompt changes can be systematically tested and deployed.

Prompt Format

Prompts can be formatted to include any attributes from spans or datasets. These attributes can be added as F-Strings or using Mustache formatting.

F-strings should be formatted with single {s:

{question}

To escape a { when using F-string, add a second { in front of it, e.g., {{escaped}} {not-escaped}. Escaping variables will remove them from inputs in the Playground.

Mustache should be formatted with double {{s:

{{question}}

We recommend using Mustache where possible, since it supports nested attributes, e.g. attributes.input.value, more seamlessly

Tools

Tools allow LLMs to interact with the external environment. This can allow LLMs to interface with your application in more controlled ways. Given a prompt and some tools to choose from an LLM may choose to use some (or one) tools or not. Many LLM API's also expose a tool choice parameter which allow you to constrain how and which tools are selected.

Here is an example of what a tool would looke like for the weather API using OpenAI.

{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
            },
            "required": ["location"],
        }
    }
}

Response Format

Some LLMs support structured responses, known as response format or output schema, allowing you to specify an exact schema for the model’s output.

Structured Outputs ensure the model consistently generates responses that adhere to a defined JSON Schema, preventing issues like missing keys or invalid values.

Benefits of Structured Outputs:

  • Reliable type-safety: Eliminates the need to validate or retry incorrectly formatted responses.

  • Explicit refusals: Enables programmatic detection of safety-based refusals.

  • Simpler prompting: Reduces reliance on strongly worded prompts for consistent formatting.

FAQs: Tracing

How to log traces

To log traces, you must instrument your application either manually or automatically. To log to a remote instance of Phoenix, you must also configure the host and port where your traces will be sent.

How to turn off tracing

Tracing can be paused temporarily or disabled permanently.

Pause tracing using context manager

If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing context manager as shown below.

Uninstrument the auto-instrumentors permanently

Calling .uninstrument() on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.

For OpenAI, how do I get token counts when streaming?

Using a custom LangChain component

Datasets Concepts

Datasets

Datasets are integral to evaluation and experimentation. They are collections of examples that provide the inputs and, optionally, expected reference outputs for assessing your application. Each example within a dataset represents a single data point, consisting of an inputs dictionary, an optional output dictionary, and an optional metadata dictionary. The optional output dictionary often contains the the expected LLM application output for the given input.

Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are then used to run experiments and evaluations to track improvements.

Use datasets to:

  • Store evaluation test cases for your eval script instead of managing large JSONL or CSV files

  • Capture generations to assess quality manually or using LLM-graded evals

  • Store user reviewed generations to find new test cases

With Phoenix, datasets are:

  • Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.

  • Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.

Creating Datasets

There are various ways to get started with datasets:

Manually Curated Examples

This is how we recommend you start. From building your application, you probably have an idea of what types of inputs you expect your application to be able to handle, and what "good" responses look like. You probably want to cover a few different common edge cases or situations you can imagine. Even 20 high quality, manually curated examples can go a long way.

Historical Logs

Once you ship an application, you start gleaning valuable information: how users are actually using it. This information can be valuable to capture and store in datasets. This allows you to test against specific use cases as you iterate on your application.

If your application is going well, you will likely get a lot of usage. How can you determine which datapoints are valuable to add? There are a few heuristics you can follow. If possible, try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future. You can also use other heuristics to identify interesting datapoints - for example, runs that took a long time to complete could be interesting to analyze and add to a dataset.

Synthetic Data

Once you have a few examples, you can try to artificially generate examples to get a lot of datapoints quickly. It's generally advised to have a few good handcrafted examples before this step, as the synthetic data will often resemble the source examples in some way.

Dataset Contents

While Phoenix doesn't have dataset types, conceptually you can contain:

Key-Value Pairs:

  • Inputs and outputs are arbitrary key-value pairs.

  • This dataset type is ideal for evaluating prompts, functions, and agents that require multiple inputs or generate multiple outputs.

If you have a RAG prompt template such as:

Your dataset might look like:

LLM inputs and outputs:

  • Simply capture the input and output as a single string to test the completion of an LLM.

  • The "inputs" dictionary contains a single "input" key mapped to the prompt string.

  • The "outputs" dictionary contains a single "output" key mapped to the corresponding response string.

Messages or chat:

  • This type of dataset is designed for evaluating LLM structured messages as inputs and outputs.

  • The "inputs" dictionary contains a "messages" key mapped to a list of serialized chat messages.

  • The "outputs" dictionary contains a "messages" key mapped to a list of serialized chat messages.

  • This type of data is useful for evaluating conversational AI systems or chatbots.

Types of Datasets

Depending on the type of contents of a given dataset, you might consider the dataset be a certain type.

Golden Dataset

A dataset that contains the inputs and the ideal "golden" output is often times is referred to as a Golden Dataset. These datasets are hand-labeled dataset and are used in evaluating the performance of LLMs or prompt templates. T.A golden dataset could look something like

is an open-source framework for building and orchestrating collaborative AI agents that act like a team of specialized virtual employees. Built on LangChain, it enables users to define roles, goals, and workflows for each agent, allowing them to work together autonomously on complex tasks with minimal setup.

📓

📓

📓

📓

📓

Phoenix offers , a module designed to measure the quality of results. This module uses a "gold" LLM (e.g. GPT-4) to decide whether the generated answer is correct in a variety of ways. Note that many of these evaluation criteria DO NOT require ground-truth labels. Evaluation can be done simply with a combination of the input (query), output (response), and context.

Evaluations can be aggregated across executions to be used as KPIs
Retrieval Evaluations can be run directly on application traces
Inferences that contain generative records can be fed into evals to produce evaluations for analysis
Adding evaluations on traces can highlight problematic areas that require further analysis
End-to-end evaluation flow
In the above screenshot you can see how poor retrieval directly correlates with hallucinations
A phoenix prompt captures everything needed to invoke an LLM
Prompt templates have placeholders for variables that are dynamically filled at runtime

In addition to environment tags, custom Git tags allow teams to label code versions in a way that fits their specific workflow (`v0.0.1`). These tags can be used to signal different stages of deployment, feature readiness, or any other meaningful status. Prompt version tags work exactly the same way as .

For more details, check out this

When running running Phoenix locally on the default port of 6006, no additional configuration is necessary.

If you are running a remote instance of Phoenix, you can configure your instrumentation to log to that instance using the PHOENIX_HOST and PHOENIX_PORT environment variables.

Alternatively, you can use the PHOENIX_COLLECTOR_ENDPOINT environment variable.

To get token counts when streaming, install openai>=1.26 and set stream_options={"include_usage": True} when calling create. Below is an example Python code snippet. For more info, see .

If you have customized a LangChain component (say a retriever), you might not get tracing for that component without some additional steps. Internally, instrumentation relies on components to inherit from LangChain base classes for the traces to show up. Below is an example of how to inherit from LanChain base classes to make a and to make traces show up.

Input
Output
Input
Output
Input
Output
Input
Output
CrewAI
View notebook
View notebook
View notebook
View notebook
View notebook
LLM Evaluations
git tags
OpenAI guide.
from phoenix.trace import suppress_tracing

with suppress_tracing():
    # Code running inside this block doesn't generate traces.
    # For example, running LLM evals here won't generate additional traces.
    ...
# Tracing will resume outside the block.
...
LangChainInstrumentor().uninstrument()
LlamaIndexInstrumentor().uninstrument()
OpenAIInstrumentor().uninstrument()
# etc.
response = openai.OpenAI().chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Write a haiku."}],
    max_tokens=20,
    stream=True,
    stream_options={"include_usage": True},
)
from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.retrievers import BaseRetriever, Document
from openinference.instrumentation.langchain import LangChainInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

PHOENIX_COLLECTOR_ENDPOINT = "http://127.0.0.1:6006/v1/traces"
tracer_provider = trace_sdk.TracerProvider()
trace_api.set_tracer_provider(tracer_provider)
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

LangChainInstrumentor().instrument()


class CustomRetriever(BaseRetriever):
    """
    This example is taken from langchain docs.
    https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/custom_retriever/
    A custom retriever that contains the top k documents that contain the user query.
    This retriever only implements the sync method _get_relevant_documents.
    If the retriever were to involve file access or network access, it could benefit
    from a native async implementation of `_aget_relevant_documents`.
    As usual, with Runnables, there's a default async implementation that's provided
    that delegates to the sync implementation running on another thread.
    """

    k: int
    """Number of top results to return"""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents: List[Document] = []

        # Custom logic to find the top k documents that contain the query

        for index in range(self.k):
            matching_documents.append(Document(page_content=f"dummy content at {index}", score=1.0))
        return matching_documents


retriever = CustomRetriever(k=3)


if __name__ == "__main__":
    documents = retriever.invoke("what is the meaning of life?")
Given the context information and not prior knowledge, answer the query.
---------------------
{context}
---------------------

Query: {query}
Answer:  

{

"query": "What is Paul Graham known for?",

"context": "Paul Graham is an investor, entrepreneur, and computer scientist known for..."

}

{

"answer": "Paul Graham is known for co-founding Y Combinator, for his writing, and for his work on the Lisp programming language." }

{

"input": "do you have to have two license plates in ontario" }

{

"output": "true"

}

{

"input": "are black beans the same as turtle beans" }

{ "output": "true" }

{ "messages": [{ "role": "system", "content": "You are an expert SQL..."}] }

{ "messages": [{ "role": "assistant", "content": "select * from users"}] }

{ "messages": [{ "role": "system", "content": "You are a helpful..."}] }

{ "messages": [{ "role": "assistant", "content": "I don't know the answer to that"}] }

Paris is the capital of France

True

Canada borders the United States

True

The native language of Japan is English

False

here
custom retriever

Inferences Concepts

This section introduces inferences and schemas, the starting concepts needed to use Phoenix with inferences.

  • For tips on creating your own Phoenix inferences and schemas, see the how-to guide.

Inferences

Phoenix inferences are an instance of phoenix.Inferences that contains three pieces of information:

  • The data itself (a pandas dataframe)

  • A name that appears in the UI

For example, if you have a dataframe prod_df that is described by a schema prod_schema, you can define inferences prod_ds with

prod_ds = px.Inferences(prod_df, prod_schema, "production")

If you launch Phoenix with these inferences, you will see inferences named "production" in the UI.

How many inferences do I need?

You can launch Phoenix with zero, one, or two sets of inferences.

With no inferences, Phoenix runs in the background and collects trace data emitted by your instrumented LLM application. With a single inference set, Phoenix provides insights into model performance and data quality. With two inference sets, Phoenix compares your inferences and gives insights into drift in addition to model performance and data quality, or helps you debug your retrieval-augmented generation applications.

Which inference set is which?

Your reference inferences provides a baseline against which to compare your primary inferences.

To compare two inference sets with Phoenix, you must select one inference set as primary and one to serve as a reference. As the name suggests, your primary inference set contains the data you care about most, perhaps because your model's performance on this data directly affects your customers or users. Your reference inferences, in contrast, is usually of secondary importance and serves as a baseline against which to compare your primary inferences.

Very often, your primary inferences will contain production data and your reference inferences will contain training data. However, that's not always the case; you can imagine a scenario where you want to check your test set for drift relative to your training data, or use your test set as a baseline against which to compare your production data. When choosing primary and reference inference sets, it matters less where your data comes from than how important the data is and what role the data serves relative to your other data.

Corpus Inference set (Information Retrieval)

The only difference for the corpus inferences is that it needs a separate schema because it have a different set of columns compared to the model data. See the schema section for more details.

Schemas

A Phoenix schema is an instance of phoenix.Schema that maps the columns of your dataframe to fields that Phoenix expects and understands. Use your schema to tell Phoenix what the data in your dataframe means.

For example, if you have a dataframe containing Fisher's Iris data that looks like this:

sepal_length
sepal_width
petal_length
petal_width
target
prediction

7.7

3.0

6.1

2.3

virginica

versicolor

5.4

3.9

1.7

0.4

setosa

setosa

6.3

3.3

4.7

1.6

versicolor

versicolor

6.2

3.4

5.4

2.3

virginica

setosa

5.8

2.7

5.1

1.9

virginica

virginica

your schema might look like this:

schema = px.Schema(
    feature_column_names=[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
    ],
    actual_label_column_name="target",
    prediction_label_column_name="prediction",
)

How many schemas do I need?

Usually one, sometimes two.

Each inference set needs a schema. If your primary and reference inferences have the same format, then you only need one schema. For example, if you have dataframes train_df and prod_df that share an identical format described by a schema named schema, then you can define inference sets train_ds and prod_ds with

train_ds = px.Inferences(train_df, schema, "training")
prod_ds = px.Inferences(prod_df, schema, "production")

Sometimes, you'll encounter scenarios where the formats of your primary and reference inference sets differ. For example, you'll need two schemas if:

  • Your production data has timestamps indicating the time at which an inference was made, but your training data does not.

  • Your training data has ground truth (what we call actuals in Phoenix nomenclature), but your production data does not.

  • A new version of your model has a differing set of features from a previous version.

In cases like these, you'll need to define two schemas, one for each inference set. For example, if you have dataframes train_df and prod_df that are described by schemas train_schema and prod_schema, respectively, then you can define inference sets train_ds and prod_ds with

train_ds = px.Inferences(train_df, train_schema, "training")
prod_ds = px.Inferences(prod_df, prod_schema, "production")

Schema for Corpus Inferences (Information Retrieval)

A corpus inference set, containing documents for information retrieval, typically has a different set of columns than those found in the model data from either production or training, and requires a separate schema. Below is an example schema for a corpus inference set with three columns: the id, text, and embedding for each document in the corpus.

corpus_schema=Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
),
corpus_ds = px.Inferences(corpus_df, corpus_schema)

Smolagents

SmolAgents is a lightweight Python library for composing tool-using, task-oriented agents. This guide outlines common agent workflows we've implemented—covering routing, evaluation loops, task orchestration, and parallel execution. For each pattern, we include an overview, a reference notebook, and guidance on how to evaluate agent quality.


Design Considerations and Limitations

While the API is minimal—centered on Agent, Task, and Tool—there are important tradeoffs and design constraints to be aware of.

Design Considerations
Limitations

API centered on Agent, Task, and Tool

Tools are just Python functions decorated with @tool. There’s no centralized registry or schema enforcement, so developers must define conventions and structure on their own.

Provides flexibility for orchestration

No retry mechanism or built-in workflow engine

Supports evaluator-optimizer loops, routing, and fan-out/fan-in

Agents are composed, not built-in abstractions

Must implement orchestration logic

Multi-Agent support

No built-in support for collaboration structures like voting, planning, or debate.

Token-level streaming is not supported

No state or memory management out of the box. Applications that require persistent state—such as conversations or multi-turn workflows—will need to integrate external storage (e.g., a vector database or key-value store).

There’s no native memory or “trajectory” tracking between agents. Handoffs between tasks are manual. This is workable in small systems, but may require structure in more complex workflows.

Prompt Chaining

This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.

How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do summaries reflect the extracted keywords?)

  • Check if the intermediate step (e.g. keyword extraction) is meaningful and accurate

  • Ensure the final output reflects or builds on the intermediate output

  • Compare chained vs. single-step prompting to see if chaining improves quality or structure


Router

Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated agent, often using lightweight classification.

How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right department for a given candidate?)

  • Compare routing decisions to human-labeled ground truth or expectations

  • Track precision/recall if framed as a classification task

  • Monitor for edge cases and routing errors (e.g., ambiguous or mixed-signal profiles)


Evaluator–Optimizer Loop

This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.

How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the message respectful, clear, and specific?)

  • Measure how many iterations are needed to reach an acceptable result

  • Evaluate final output quality against criteria like tone, clarity, and specificity

  • Compare the evaluator’s judgment to human reviewers to calibrate reliability


Orchestrator + Worker Pattern

In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.

How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final recommendation align with the inputs from each worker agent?)

  • Ensure each worker agent completes its role accurately and in isolation

  • Check if the orchestrator integrates worker outputs into a consistent final result

  • Look for agreement or contradictions between components (e.g., technical fit vs. recommendation)


Parallel Agent Execution

When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.

How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are profiles processed correctly and faster when run in parallel?)

  • Confirm that outputs are consistent with those from a sequential execution

  • Track total latency and per-task runtime to assess parallel speedup

  • Watch for race conditions, dropped inputs, or silent failures in concurrency


Langsmith alternatives? Arize Phoenix vs LangSmith: key differences

What is the difference between Arize Phoenix and LangSmith

LangSmith is another LLM Observability and Evaluation platform that serves as an alternative to Arize Phoenix. Both platforms support the baseline tracing, evaluation, prompt management, and experimentation features, but there are a few key differences to be aware of:

  1. LangSmith is closed source, while Phoenix is open source

  2. LangSmith is part of the broader LangChain ecosystem, though it does support applications that don’t use LangChain. Phoenix is fully framework-agnostic.

  3. Self-hosting is a paid feature within LangSmith, vs free for Phoenix.

  4. Phoenix is backed by Arize AI. Phoenix users always have the option to graduate into Arize AX, with additional features, a customer success org, infosec team, and dedicated support. Meanwhile, Phoenix is able to focus entirely on providing the best fully open-source solution in the ecosystem.


Open vs. Closed Source

The first and most fundamental difference: LangSmith is closed source, while Phoenix is fully open source.

This means Phoenix users have complete control over how the platform is used, modified, and integrated. Whether you're running in a corporate environment with custom compliance requirements or you're building novel agent workflows, open-source tooling allows for a degree of flexibility and transparency that closed platforms simply can’t match.

LangSmith users, on the other hand, are dependent on a vendor roadmap and pricing model, with limited ability to inspect or modify the underlying system.


Ecosystem Lock-In vs. Ecosystem-Agnostic

LangSmith is tightly integrated with the LangChain ecosystem, and while it technically supports non-LangChain applications, the experience is optimized for LangChain-native workflows.

Phoenix is designed from the ground up to be framework-agnostic. It supports popular orchestration tools like LangChain, LlamaIndex, CrewAI, SmolAgents, and custom agents, thanks to its OpenInference instrumentation layer. This makes Phoenix a better choice for teams exploring multiple agent/orchestration frameworks—or who simply want to avoid vendor lock-in.


Self-Hosting: Free vs. Paid

If self-hosting is a requirement—for reasons ranging from data privacy to performance—Phoenix offers it out-of-the-box, for free. You can launch the entire platform with a single Docker container, no license keys or paywalls required.

LangSmith, by contrast, requires a paid plan to access self-hosting options. This can be a barrier for teams evaluating tools or early in their journey, especially those that want to maintain control over their data from day one.


Backed by Arize AI

Arize Phoenix is intended to be a complete LLM observability solution, however for users who do not want to self-host, or who need additional features like Custom Dashboards, Copilot, Dedicated Support, or HIPAA compliance, there is a seamless upgrade path to Arize AX.

The success of Arize means that Phoenix does not need to be heavily commercialized. It can focus entirely on providing the best open-source solution for LLM Observability & Evaluation.


Feature Comparison

Feature
Arize Phoenix
Arize AX
LangSmith

Open Source

✅

Tracing

✅

✅

✅

Auto-Instrumentation

✅

✅

Offline Evals

✅

✅

✅

Online Evals

✅

✅

Experimentation

✅

✅

✅

Prompt Management

✅

✅

✅

Prompt Playground

✅

✅

✅

Run Prompts on Datasets

✅

✅

✅

Built-in Evaluators

✅

✅

✅

Agent Evaluations

✅

✅

✅

Human Annotations

✅

✅

✅

Custom Dashboards

✅

Workspaces

✅

Semantic Querying

✅

Copilot Assistant

✅


Final Thoughts

LangSmith is a strong option for teams all-in on the LangChain ecosystem and comfortable with a closed-source platform. But for those who value openness, framework flexibility, and low-friction adoption, Arize Phoenix stands out as the more accessible and extensible observability solution.

Google GenAI SDK (Manual Orchestration)

Everything you need to know about Google's GenAI framework

In April 2025, Google launched its ADK framework, which is a more comparable agent orchestration framework to the others on this list.

That said, because of the relative simplicity of the GenAI SDK, this guide serves as a good learning tool to show how some of the common agent patterns can be manually implemented.

Framework Primitives

GenAI SDK uses contents to represent user messages, files, system messages, function calls, and invocation parameters. That creates relatively simple generation calls:

Content objections can also be composed together in a list:

Patterns

Google GenAI does not include built in orchestration patterns.

Handoffs and State

GenAI has no concept of handoffs natively.

State is handled by maintaining a list of previous messages and other data in a list of content objections. This is similar to how other model SDKs like OpenAI and Anthropic handle the concept of state. This stands in contrast to the more sophisticated measurements of state present in agent orchestration frameworks.

Tools

GenAI does include some conveience features around tool calling. The types.GenerateContentConfig method can automatically convert base python functions into signatures. To do this, the SDK will use the function docstring to understand its purpose and arguments.

GenAI will also automatically call the function and incorporate its return value. This goes a step beyond what similar model SDKs do on other platforms. This behavior can be disabled.

Memory

GenAI has no built-in concept of memory.

Multi-Agent Collaboration

GenAI has no built-in collaboration strategies. These must be defined manually.

Streaming

GenAI supports streaming of both text and image responses:


Design Considerations and Limitations

GenAI is the "simplest" framework in this guide, and is closer to a pure model SDK like the OpenAI SDK, rather than an agent framework. It does go a few steps beyond these base SDKs however, notably in tool calling. It is a good option if you're using Gemini models, and want more direct control over your agent system.


Agent Design Patterns

Prompt Chaining

This workflow breaks a task into smaller steps, where the output of one agent becomes the input to another. It’s useful when a single prompt can’t reliably handle the full complexity or when you want clarity in intermediate reasoning.

Notebook: Research Agent The agent first researches a topic, then provides an executive summary of its results, then finally recommends future focus directions.

How to evaluate: Check whether each step performs its function correctly and whether the final result meaningfully depends on the intermediate output (e.g., do key points reflect the original research?)

  • Check if the intermediate step (e.g. key point extraction) is meaningful and accurate

  • Ensure the final output reflects or builds on the intermediate output

  • Compare chained vs. single-step prompting to see if chaining improves quality or structure


Router

Routing is used to send inputs to the appropriate downstream agent or workflow based on their content. The routing logic is handled by a dedicated call, often using lightweight classification.

Notebook: Simple Tool Router This agent shows a simple example of routing use inputs to different tools.

How to evaluate: Compare the routing decision to human judgment or labeled examples (e.g., did the router choose the right tool for a given input?)

  • Compare routing decisions to human-labeled ground truth or expectations

  • Track precision/recall if framed as a classification task

  • Monitor for edge cases and routing errors


Evaluator–Optimizer Loop

This pattern uses two agents in a loop: one generates a solution, the other critiques it. The generator revises until the evaluator accepts the result or a retry limit is reached. It’s useful when quality varies across generations.

Notebook: Story Writing Agent An agent generates an initial draft of a story, then a critique agent decides whether the quality is high enough. If not, it asks for a revision.

How to evaluate: Track how many iterations are needed to converge and whether final outputs meet predefined criteria (e.g., is the story engaging, clear, and well-written?)

  • Measure how many iterations are needed to reach an acceptable result

  • Evaluate final output quality against criteria like tone, clarity, and specificity

  • Compare the evaluator’s judgment to human reviewers to calibrate reliability


Orchestrator + Worker Pattern

In this approach, a central agent coordinates multiple agents, each with a specialized role. It’s helpful when tasks can be broken down and assigned to domain-specific workers.

Notebook: Travel Planning Agent The orchestrator delegates planning a trip for a user, and incorporates a user proxy to improve its quality. The orchestrator delegates to specific functions to plan flights, hotels, and provide general travel recommendations.

How to evaluate: Assess consistency between subtasks and whether the final output reflects the combined evaluations (e.g., does the final output align with the inputs from each worker agent?)

  • Ensure each worker agent completes its role accurately and in isolation

  • Check if the orchestrator integrates worker outputs into a consistent final result

  • Look for agreement or contradictions between components


Parallel Agent Execution

When you need to process many inputs using the same logic, parallel execution improves speed and resource efficiency. Agents can be launched concurrently without changing their individual behavior.

Notebook: Parallel Research Agent Multiple research topics are examined simultaneously. Once all are complete, the topics are then synthesized into a final combined report.

How to evaluate: Ensure results remain consistent with sequential runs and monitor for improvements in latency and throughput (e.g., are topics processed correctly and faster when run in parallel?)

  • Confirm that outputs are consistent with those from a sequential execution

  • Track total latency and per-task runtime to assess parallel speedup

  • Watch for race conditions, dropped inputs, or silent failures in concurrency

Langfuse alternative? Arize Phoenix vs Langfuse: key differences

What is the difference between Arize Phoenix and Langfuse?

Langfuse has an initially similar feature set to Arize Phoenix. Both tools support tracing, evaluation, experimentation, and prompt management, both in development and production. But on closer inspection there are a few notable differences:

  1. While it is open-source, Langfuse locks certain key features like Prompt Playground and LLM-as-a-Judge evals behind a paywall. These same features are free in Phoenix.

  2. Phoenix is significantly easier to self-host than Langfuse. Langfuse requires you to separately setup and link Clickhouse, Redis, and S3. Phoenix can be hosted out-of-the-box as a single docker container.

  3. Langfuse relies on outside instrumentation libraries to generate traces. Arize maintains its own layer that operates in concert with OpenTelemetry for instrumentation.

  4. Phoenix is backed by Arize AI. Phoenix users always have the option to graduate into Arize AX, with additional features, a customer success org, infosec team, and dedicated support. Meanwhile, Phoenix is able to focus entirely on providing the best fully open-source solution in the ecosystem.


Feature Access

Langfuse is open-source, but several critical features are gated behind its paid offering when self-hosting. For example:

  • Prompt Playground

  • LLM-as-a-Judge evaluations

  • Prompt experiments

  • Annotation queues

These features can be crucial for building and refining LLM systems, especially in early prototyping stages. In contrast, Arize Phoenix offers these capabilities fully open-source.


Ease of Self-Hosting

Self-hosting Langfuse requires setting up and maintaining:

  • A ClickHouse database for analytics

  • Redis for caching and background jobs

  • S3-compatible storage for logs and artifacts

Arize Phoenix, on the other hand, can be launched with a single Docker container. No need to stitch together external services—Phoenix is designed to be drop-in simple for both experimentation and production monitoring. This “batteries-included” philosophy makes it faster to adopt and easier to maintain.


Instrumentation Approach

Langfuse does not provide its own instrumentation layer—instead, it relies on developers to integrate third-party libraries to generate and send trace data.

In fact, Langfuse supports OpenInference tracing as one of its options. This means that using Langfuse requires at least one additional dependency on an instrumentation provider.


Backed by Arize AI

Arize Phoenix is intended to be a complete LLM observability solution, however for users who do not want to self-host, or who need additional features like Custom Dashboards, Copilot, Dedicated Support, or HIPAA compliance, there is a seamless upgrade path to Arize AX.

The success of Arize means that Phoenix does not need to be heavily commercialized. It can focus entirely on providing the best open-source solution for LLM Observability & Evaluation.


Feature Comparison


Final Thoughts

If you're choosing between Langfuse and Arize Phoenix, the right tool will depend on your needs. Langfuse has a polished UI and solid community momentum, but imposes friction around hosting and feature access. Arize Phoenix offers a more open, developer-friendly experience—especially for those who want a single-container solution with built-in instrumentation and evaluation tools.

Google Colaboratory
See "Classifications with Explanations Section"
Eval-Optimizer Loop
Routing Flow
Prompt Chaining Flow
Orchestrator Flow

For comprehensive descriptions of phoenix.Inferences and phoenix.Schema, see the .

A (a phoenix.Schema instance) that describes the columns of your dataframe

Notebook: The agent first extracts keywords from a resume, then summarizes what those keywords suggest.

Notebook: The agent classifies candidate profiles into Software, Product, or Design categories, then hands them off to the appropriate evaluation pipeline.

Notebook: An agent writes a candidate rejection email. If the evaluator agent finds the tone or feedback lacking, it asks for a revision.

Notebook: The orchestrator delegates resume review, culture fit assessment, and decision-making to different agents, then composes a final recommendation.

Notebook: Candidate reviews are distributed using asyncio, enabling faster batch processing without compromising output quality.

Phoenix is backed by , the leading and best-funded AI Observability provider in the ecosystem.

Google's is a framework designed to help you interact with Gemini models and models run through VertexAI. Out of all the frameworks detailed in this guide, GenAI SDK is the closest to a base model SDK. While it does provide helpful functions and concepts to streamline tool calling, structured output, and passing files, it does not approach the level of abstraction of frameworks like CrewAI or Autogen.

Design Considerations
Limitations

Phoenix takes a different approach: it includes and maintains its own OpenTelemetry-compatible instrumentation layer, .

Phoenix is backed by , the leading and best-funded AI Observability provider in the ecosystem.

Feature
Arize Phoenix
Arize AX
Langfuse

Use Zero Inference sets When:

  • You want to run Phoenix in the background to collect trace data from your instrumented LLM application.

Use a Single Inference set When:

  • You have only a single cohort of data, e.g., only training data.

  • You care about model performance and data quality, but not drift.

Use Two Inference sets When:

  • You want to compare cohorts of data, e.g., training vs. production.

  • You care about drift in addition to model performance and data quality.

  • You have corpus data for information retrieval. See Corpus Data.

API reference
schema
Prompt Chaining with Keyword Extraction + Summarization
Candidate Interview Router
Rejection Email Generator with Evaluation Loop
Recruiting Evaluator Orchestrator
Arize AI
file = client.files.upload(file='a11.txt')
response = client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents=['Could you summarize this file?', file]
)
print(response.text)
[
  types.UserContent(
    parts=[
      types.Part.from_text('What is this image about?'),
      types.Part.from_uri(
        file_uri: 'gs://generativeai-downloads/images/scones.jpg',
        mime_type: 'image/jpeg',
      )
    ]
  )
]
def get_current_weather(location: str) -> str:
    """Returns the current weather.

    Args:
      location: The city and state, e.g. San Francisco, CA
    """
    return 'sunny'


response = client.models.generate_content(
    model='gemini-2.0-flash-001',
    contents='What is the weather like in Boston?',
    config=types.GenerateContentConfig(tools=[get_current_weather]),
)

print(response.text)
for chunk in client.models.generate_content_stream(
    model='gemini-2.0-flash-001', contents='Tell me a story in 300 words.'
):
    print(chunk.text, end='')

Content approach streamlines message management

No built-in orchestration capabilities

Supports automatic tool calling

No state or memory management

Allows for all agent patterns, but each must be manually set up

Primarily designed to work with Gemini models

Open Source

✅

✅

Tracing

✅

✅

✅

Auto-Instrumentation

✅

✅

Offline Evals

✅

✅

✅

Online Evals

✅

✅

Experimentation

✅

✅

✅

Prompt Management

✅

✅

✅

Prompt Playground

✅

✅

✅

Run Prompts on Datasets

✅

✅

Built-in Evaluators

✅

✅

✅

Agent Evaluations

✅

✅

Human Annotations

✅

✅

Custom Dashboards

✅

Workspaces

✅

Semantic Querying

✅

Copilot Assistant

✅

Parallelization Flow
GenAI SDK
OpenInference
Arize AI
Can I run Phoenix on Sagemaker?
Logo
import phoenix as px
from phoenix.trace import LangChainInstrumentor

px.launch_app()

LangChainInstrumentor().instrument()

# run your LangChain application
import os
from phoenix.trace import LangChainInstrumentor

# assume phoenix is running at 162.159.135.42:6007
os.environ["PHOENIX_HOST"] = "162.159.135.42"
os.environ["PHOENIX_PORT"] = "6007"

LangChainInstrumentor().instrument()  # logs to http://162.159.135.42:6007

# run your LangChain application
import os
from phoenix.trace import LangChainInstrumentor

# assume phoenix is running at 162.159.135.42:6007
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "162.159.135.42:6007"

LangChainInstrumentor().instrument()  # logs to http://162.159.135.42:6007

# run your LangChain application

What are Traces

A deep dive into the details of a trace

Spans

A span represents a unit of work or operation (think a span of time). It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.

A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks. A span for an LLM execution in JSON format is displayed below

{
   "name": "llm",
   "context": {
       "trace_id": "0x6c80880dbeb609e2ed41e06a6397a0dd",
       "span_id": "0xd9bdedf0df0b7208",
       "trace_state": "[]"
   },
   "kind": "SpanKind.INTERNAL",
   "parent_id": "0x7eb5df0046c77cd2",
   "start_time": "2024-05-08T21:46:11.480777Z",
   "end_time": "2024-05-08T21:46:35.368042Z",
   "status": {
       "status_code": "OK"
   },
   "attributes": {
       "openinference.span.kind": "LLM",
       "llm.input_messages.0.message.role": "system",
       "llm.input_messages.0.message.content": "\n  The following is a friendly conversation between a user and an AI assistant.\n  The assistant is talkative and provides lots of specific details from its context.\n  If the assistant does not know the answer to a question, it truthfully says it\n  does not know.\n\n  Here are the relevant documents for the context:\n\n  page_label: 7\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.4Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\na. No piece may weigh more than 70 pounds.\nb. The combined length and girth of a piece (the length of its longest side plus \nthe distance around its thickest part) may not exceed 108 inches.\nc. Lower size or weight standards apply to mail addressed to certain APOs and \nFPOs, subject to 703.2.0  and 703.4.0  and for Department of State mail, \nsubject to 703.3.0 .\n\npage_label: 6\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.2.10Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\na. The reply half of a double card must be used for reply only and may not be \nused to convey a message to the original addressee or to send statements \nof account. The reply half may be formatted for response purposes (e.g., contain blocks for completion by the addressee).\nb. A double card must be folded before mailing and prepared so that the \naddress on the reply half is on the inside when the double card is originally \nmailed. The address side of the reply half may be prepared as Business \nReply Mail, Courtesy Reply Mail, meter reply mail, or as a USPS Returns service label.\nc. Plain stickers, seals, or a single wire stitch (staple) may be used to fasten the \nopen edge at the top or bottom once the card is folded if affixed so that the \ninner surfaces of the cards can be readily examined. Fasteners must be \naffixed according to the applicable preparation requirements for the price claimed. Any sealing on the left and right sides of the cards, no matter the \nsealing process used, is not permitted.\nd. The first half of a double card must be detached when the reply half is \nmailed for return. \n6.2.10   Enclosures\nEnclosures in double postcards are prohibited at card prices. \n6.3 Nonmachinable Pieces\n6.3.1   Nonmachinable Letters\nLetter-size pieces (except card-size pieces) that meet one or more of the \nnonmachinable characteristics in 1.2 are subject to the nonmachinable \nsurcharge (see 133.1.7 ). \n6.3.2   Nonmachinable Flats\nFlat-size pieces that do not meet the standards in 2.0 are considered parcels, \nand the mailer must pay the applicable parcel price.  \n6.4 Parcels \n[7-9-23]  USPS Ground Advantage \u2014 Retail parcels are eligible for USPS \nTracking and Signature Confirmation service. A USPS Ground Advantage \u2014 \nRetail parcel is the following:\na. A mailpiece that exceeds any one of the maximum dimensions for a flat \n(large envelope). See 2.1.\nb. A flat-size mailpiece, regardless of thickness, that is rigid or nonrectangular. \nc. A flat-size mailpiece that is not uniformly thick under 2.4. \nd.[7-9-23]  A mailpiece that does not exceed 130 inches in combined length \nand girth.\n7.0 Additional Physical Standards for Media Mail and Library \nMail\nThese standards apply to Media Mail and Library Mail:\n\npage_label: 4\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.1Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\n4.0 Additional Physical Standa rds for Priority Mail Express\nEach piece of Priority Mail Express may not weigh more than 70 pounds. The \ncombined length and girth of a piece (the length of its longest side plus the \ndistance around its thickest part) may not exceed 108 inches. Lower size or weight standards apply to Priority Mail Express addressed to certain APO/FPO \nand DPOs. Priority Mail Express items must be large enough to hold the required \nmailing labels and indicia on a single optical plane without bending or folding.\n5.0 Additional Physical St andards for Priority Mail\nThe maximum weight is 70 pounds. The combined length and girth of a piece \n(the length of its longest side plus the distance around its thickest part) may not \nexceed 108 inches. Lower size and weight standards apply for some APO/FPO \nand DPO mail subject to 703.2.0 , and 703.4.0 , and for Department of State mail \nsubject to 703.3.0 . \n[7-9-23] \n6.0 Additional Physical Standa rds for First-Class Mail and \nUSPS Ground Advantage \u2014 Retail\n[7-9-23]\n6.1 Maximum Weight\n6.1.1   First-Class Mail\nFirst-Class Mail (letters and flats) must not exceed 13 ounces. \n6.1.2   USPS Ground Advantage \u2014 Retail\nUSPS Ground Advantage \u2014 Retail mail must not exceed 70 pounds.\n6.2 Cards Claimed at Card Prices\n6.2.1   Card Price\nA card may be a single or double (reply) stamped card or a single or double postcard. Stamped cards are available from USPS with postage imprinted on \nthem. Postcards are commercially available or privately printed mailing cards. To \nbe eligible for card pricing, a card and each half of a double card must meet the physical standards in 6.2 and the applicable eligibility for the price claimed. \nIneligible cards are subject to letter-size pricing. \n6.2.2   Postcard Dimensions\nEach card and part of a double card claimed at card pricing must be the following: \na. Rectangular.b. Not less than 3-1/2 inches high, 5 inches long, and 0.007 inch thick.\nc. Not more than 4-1/4 inches high, or more than 6 inches long, or greater than \n0.016 inch thick.\nd. Not more than 3.5 ounces (Charge flat-size prices for First-Class Mail \ncard-type pieces over 3.5 ounces.)\n\n  Instruction: Based on the above documents, provide a detailed answer for the user question below.\n  Answer \"don't know\" if not present in the document.\n  ",
       "llm.input_messages.1.message.role": "user",
       "llm.input_messages.1.message.content": "Hello",
       "llm.model_name": "gpt-4-turbo-preview",
       "llm.invocation_parameters": "{\"temperature\": 0.1, \"model\": \"gpt-4-turbo-preview\"}",
       "output.value": "How are you?" },
   "events": [],
   "links": [],
   "resource": {
       "attributes": {},
       "schema_url": ""
   }
}

Spans can be nested, as is implied by the presence of a parent span ID: child spans represent sub-operations. This allows spans to more accurately capture the work done in an application.

Traces

A trace records the paths taken by requests (made by an application or end-user) as they propagate through multiple steps.

Without tracing, it is challenging to pinpoint the cause of performance problems in a system.

It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for LLM applications, which commonly have nondeterministic problems or are too complicated to reproduce locally.

Tracing makes debugging and understanding LLM applications less daunting by breaking down what happens within a request as it flows through a system.

A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).

Projects

A project is a collection of traces. You can think of a project as a container for all the traces that are related to a single application or service. You can have multiple projects, and each project can have multiple traces. Projects can be useful for various use-cases such as separating out environments, logging traces for evaluation runs, etc. To learn more about how to setup projects, see the how-to guide

Span Kind

When a span is created, it is created as one of the following: Chain, Retriever, Reranker, LLM, Embedding, Agent, or Tool.

CHAIN

A Chain is a starting point or a link between different LLM application steps. For example, a Chain span could be used to represent the beginning of a request to an LLM application or the glue code that passes context from a retriever to and LLM call.

RETRIEVER

A Retriever is a span that represents a data retrieval step. For example, a Retriever span could be used to represent a call to a vector store or a database.

RERANKER

A Reranker is a span that represents the reranking of a set of input documents. For example, a cross-encoder may be used to compute the input documents' relevance scores with respect to a user query, and the top K documents with the highest scores are then returned by the Reranker.

LLM

An LLM is a span that represents a call to an LLM. For example, an LLM span could be used to represent a call to OpenAI or Llama.

EMBEDDING

An Embedding is a span that represents a call to an LLM for an embedding. For example, an Embedding span could be used to represent a call OpenAI to get an ada-2 embedding for retrieval.

TOOL

A Tool is a span that represents a call to an external tool such as a calculator or a weather API.

AGENT

A span that encompasses calls to LLMs and Tools. An agent describes a reasoning block that acts on tools using the guidance of an LLM.\

Span Attributes

Attributes are key-value pairs that contain metadata that you can use to annotate a span to carry information about the operation it is tracking.

For example, if a span invokes an LLM, you can capture the model name, the invocation parameters, the token count, and so on.

Attributes have the following rules:

  • Keys must be non-null string values

Example OTEL Spans

Below are example OTEL spans for each OpenInference spanKind to be used as reference when doing manual instrumentation

{
   "name": "llm",
   "context": {
       "trace_id": "0x6c80880dbeb609e2ed41e06a6397a0dd",
       "span_id": "0xd9bdedf0df0b7208",
       "trace_state": "[]"
   },
   "kind": "SpanKind.INTERNAL",
   "parent_id": "0x7eb5df0046c77cd2",
   "start_time": "2024-05-08T21:46:11.480777Z",
   "end_time": "2024-05-08T21:46:35.368042Z",
   "status": {
       "status_code": "OK"
   },
   "attributes": {
       "openinference.span.kind": "LLM",
       "llm.input_messages.0.message.role": "system",
       "llm.input_messages.0.message.content": "\n  The following is a friendly conversation between a user and an AI assistant.\n  The assistant is talkative and provides lots of specific details from its context.\n  If the assistant does not know the answer to a question, it truthfully says it\n  does not know.\n\n  Here are the relevant documents for the context:\n\n  page_label: 7\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.4Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\na. No piece may weigh more than 70 pounds.\nb. The combined length and girth of a piece (the length of its longest side plus \nthe distance around its thickest part) may not exceed 108 inches.\nc. Lower size or weight standards apply to mail addressed to certain APOs and \nFPOs, subject to 703.2.0  and 703.4.0  and for Department of State mail, \nsubject to 703.3.0 .\n\npage_label: 6\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.2.10Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\na. The reply half of a double card must be used for reply only and may not be \nused to convey a message to the original addressee or to send statements \nof account. The reply half may be formatted for response purposes (e.g., contain blocks for completion by the addressee).\nb. A double card must be folded before mailing and prepared so that the \naddress on the reply half is on the inside when the double card is originally \nmailed. The address side of the reply half may be prepared as Business \nReply Mail, Courtesy Reply Mail, meter reply mail, or as a USPS Returns service label.\nc. Plain stickers, seals, or a single wire stitch (staple) may be used to fasten the \nopen edge at the top or bottom once the card is folded if affixed so that the \ninner surfaces of the cards can be readily examined. Fasteners must be \naffixed according to the applicable preparation requirements for the price claimed. Any sealing on the left and right sides of the cards, no matter the \nsealing process used, is not permitted.\nd. The first half of a double card must be detached when the reply half is \nmailed for return. \n6.2.10   Enclosures\nEnclosures in double postcards are prohibited at card prices. \n6.3 Nonmachinable Pieces\n6.3.1   Nonmachinable Letters\nLetter-size pieces (except card-size pieces) that meet one or more of the \nnonmachinable characteristics in 1.2 are subject to the nonmachinable \nsurcharge (see 133.1.7 ). \n6.3.2   Nonmachinable Flats\nFlat-size pieces that do not meet the standards in 2.0 are considered parcels, \nand the mailer must pay the applicable parcel price.  \n6.4 Parcels \n[7-9-23]  USPS Ground Advantage \u2014 Retail parcels are eligible for USPS \nTracking and Signature Confirmation service. A USPS Ground Advantage \u2014 \nRetail parcel is the following:\na. A mailpiece that exceeds any one of the maximum dimensions for a flat \n(large envelope). See 2.1.\nb. A flat-size mailpiece, regardless of thickness, that is rigid or nonrectangular. \nc. A flat-size mailpiece that is not uniformly thick under 2.4. \nd.[7-9-23]  A mailpiece that does not exceed 130 inches in combined length \nand girth.\n7.0 Additional Physical Standards for Media Mail and Library \nMail\nThese standards apply to Media Mail and Library Mail:\n\npage_label: 4\nfile_path: /Users/mikeldking/work/openinference/python/examples/llama-index-new/backend/data/101.pdf\n\nDomestic Mail Manual \u2022 Updated 7-9-23101\n101.6.1Retail Mail: Physical Standards for Letters, Cards, Flats, and Parcels\n4.0 Additional Physical Standa rds for Priority Mail Express\nEach piece of Priority Mail Express may not weigh more than 70 pounds. The \ncombined length and girth of a piece (the length of its longest side plus the \ndistance around its thickest part) may not exceed 108 inches. Lower size or weight standards apply to Priority Mail Express addressed to certain APO/FPO \nand DPOs. Priority Mail Express items must be large enough to hold the required \nmailing labels and indicia on a single optical plane without bending or folding.\n5.0 Additional Physical St andards for Priority Mail\nThe maximum weight is 70 pounds. The combined length and girth of a piece \n(the length of its longest side plus the distance around its thickest part) may not \nexceed 108 inches. Lower size and weight standards apply for some APO/FPO \nand DPO mail subject to 703.2.0 , and 703.4.0 , and for Department of State mail \nsubject to 703.3.0 . \n[7-9-23] \n6.0 Additional Physical Standa rds for First-Class Mail and \nUSPS Ground Advantage \u2014 Retail\n[7-9-23]\n6.1 Maximum Weight\n6.1.1   First-Class Mail\nFirst-Class Mail (letters and flats) must not exceed 13 ounces. \n6.1.2   USPS Ground Advantage \u2014 Retail\nUSPS Ground Advantage \u2014 Retail mail must not exceed 70 pounds.\n6.2 Cards Claimed at Card Prices\n6.2.1   Card Price\nA card may be a single or double (reply) stamped card or a single or double postcard. Stamped cards are available from USPS with postage imprinted on \nthem. Postcards are commercially available or privately printed mailing cards. To \nbe eligible for card pricing, a card and each half of a double card must meet the physical standards in 6.2 and the applicable eligibility for the price claimed. \nIneligible cards are subject to letter-size pricing. \n6.2.2   Postcard Dimensions\nEach card and part of a double card claimed at card pricing must be the following: \na. Rectangular.b. Not less than 3-1/2 inches high, 5 inches long, and 0.007 inch thick.\nc. Not more than 4-1/4 inches high, or more than 6 inches long, or greater than \n0.016 inch thick.\nd. Not more than 3.5 ounces (Charge flat-size prices for First-Class Mail \ncard-type pieces over 3.5 ounces.)\n\n  Instruction: Based on the above documents, provide a detailed answer for the user question below.\n  Answer \"don't know\" if not present in the document.\n  ",
       "llm.input_messages.1.message.role": "user",
       "llm.input_messages.1.message.content": "Hello",
       "llm.model_name": "gpt-4-turbo-preview",
       "llm.invocation_parameters": "{\"temperature\": 0.1, \"model\": \"gpt-4-turbo-preview\"}",
       "output.value": "How are you?" },
   "events": [],
   "links": [],
   "resource": {
       "attributes": {},
       "schema_url": ""
   }
}
{
     "name": "retrieve",
     "context": {
         "trace_id": "0x6c80880dbeb609e2ed41e06a6397a0dd",
         "span_id": "0x03f3466720f4bfc7",
         "trace_state": "[]"
     },
     "kind": "SpanKind.INTERNAL",
     "parent_id": "0x7eb5df0046c77cd2",
     "start_time": "2024-05-08T21:46:11.044464Z",
     "end_time": "2024-05-08T21:46:11.465803Z",
     "status": {
         "status_code": "OK"
     },
     "attributes": {
         "openinference.span.kind": "RETRIEVER",
         "input.value": "tell me about postal service",
         "retrieval.documents.0.document.id": "6d4e27be-1d6d-4084-a619-351a44834f38",
         "retrieval.documents.0.document.score": 0.7711453293100421,
         "retrieval.documents.0.document.content": "<document-chunk-1>",       
         "retrieval.documents.0.document.metadata": "{\"page_label\": \"7\", \"file_name\": \"/data/101.pdf\", \"file_path\": \"/data/101.pdf\", \"file_type\": \"application/pdf\", \"file_size\": 47931, \"creation_date\": \"2024-04-12\", \"last_modified_date\": \"2024-04-12\"}",
         "retrieval.documents.1.document.id": "869d9f6d-db9a-43c4-842f-74bd8d505147",
         "retrieval.documents.1.document.score": 0.7672439175862021,
         "retrieval.documents.1.document.content": "<document-chunk-2>",
         "retrieval.documents.1.document.metadata": "{\"page_label\": \"6\", \"file_name\": \"/data/101.pdf\", \"file_path\": \"/data/101.pdf\", \"file_type\": \"application/pdf\", \"file_size\": 47931, \"creation_date\": \"2024-04-12\", \"last_modified_date\": \"2024-04-12\"}",
         "retrieval.documents.2.document.id": "72b5cb6b-464f-4460-b497-cc7c09d1dbef",
         "retrieval.documents.2.document.score": 0.7647611816897794,
         "retrieval.documents.2.document.content": "<document-chunk-3>",
         "retrieval.documents.2.document.metadata": "{\"page_label\": \"4\", \"file_name\": \"/data/101.pdf\", \"file_path\": \"/data/101.pdf\", \"file_type\": \"application/pdf\", \"file_size\": 47931, \"creation_date\": \"2024-04-12\", \"last_modified_date\": \"2024-04-12\"}"
     },
     "events": [],
     "links": [],
     "resource": {
         "attributes": {},
         "schema_url": ""
     }
 }
Google Colab
Google Colab
Google Colab
Google Colab
Google Colab
The SpanKinds supported by OpenInference Tracing

Values must be a non-null string, boolean, floating point value, integer, or an array of these values Additionally, there are Semantic Attributes, which are known naming conventions for metadata that is typically present in common operations. It's helpful to use semantic attribute naming wherever possible so that common kinds of metadata are standardized across systems. See for more information.

semantic conventions
Logo
Logo
Logo
Video That Reviews the Material
Logo
Logo
Google Colaboratory
phoenix/tutorials/agents/crewai/crewai_routing_tutorial.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/crewai/crewai_orchestrator_workers_tutorial.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/crewai/crewai_ parallelization_tutorial.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/crewai/crewai_prompt_chaining_tutorial.ipynb at main · Arize-ai/phoenixGitHub
Google Colaboratory
Logo
Logo

OpenAI Agents

Build multi-agent workflows with OpenAI Agents

  • Agents, which are LLMs equipped with instructions and tools

  • Handoffs, which allow agents to delegate to other agents for specific tasks

  • Guardrails, which enable the inputs to agents to be validated

This guide outlines common agent workflows using this SDK. We will walk through building an investment agent across several use cases.

from agents import Agent, Runner, WebSearchTool

agent = Agent(
    name="Finance Agent",
    instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
    tools=[WebSearchTool()],
    model="gpt-4.1-mini",
)

Design Considerations and Limitations

Design Considerations
Features & Limitations

Model support

First class support for OpenAI LLMs, and basic support for any LLM using a LiteLLM wrapper. Support for reasoning effort parameter to tradeoff on reducing latency or increasing accuracy.

Structured outputs

Tools

Very easy, using the @function_call decorator. Support for parallel tool calls to reduce latency. Built-in support for OpenAI SDK for WebSearchTool, ComputerTool, and FileSearchTool

Agent handoff

Very easy using handoffs variable

Multimodal support

Voice support, no support for images or video

Guardrails

Enables validation of both inputs and outputs

Retry logic

⚠️ No retry logic, developers must manually handle failure cases

Memory

⚠️ No built-in memory management. Developers must manage their own conversation and user memory.

Code execution

⚠️ No built-in support for executing code

Simple agent

An LLM agent with access to tools to accomplish a task is the most basic flow. This agent answers questions about stocks and uses OpenAI web search to get real time information.

from agents import Agent, Runner, WebSearchTool

agent = Agent(
    name="Finance Agent",
    instructions="You are a finance agent that can answer questions about stocks. Use web search to retrieve up‑to‑date context. Then, return a brief, concise answer that is one sentence long.",
    tools=[WebSearchTool()],
    model="gpt-4.1-mini",
)

Prompt chaining

This agent builds a portfolio of stocks and ETFs using multiple agents linked together:

  1. Search Agent: Searches the web for information on particular stock tickers.

  2. Report Agent: Creates a portfolio of stocks and ETFs that supports the user's investment strategy.

portfolio_agent = Agent(
    name="Portfolio Agent",
    instructions="You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation.",
    model="o4-mini",
    output_type=Portfolio,
)

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions="You are a research assistant specializing in financial topics. Given an investment strategy, use web search to retrieve up‑to‑date context and produce a short summary of stocks that support the investment strategy at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
    model="gpt-4.1",
    tools=[WebSearchTool()],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

Parallelization

This agent researches stocks for you. If we want to research 5 stocks, we can force the agent to run multiple tool calls, instead of sequentially.

@function_tool
def get_stock_data(ticker_symbol: str) -> dict:
    """
    Get stock data for a given ticker symbol.
    Args:
        ticker_symbol: The ticker symbol of the stock to get data for.
    Returns:
        A dictionary containing stock data such as price, market cap, and more.
    """
    import yfinance as yf
    stock = yf.Ticker(ticker_symbol)
    return stock.info

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions=dedent(
        """You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
    ),
    model="gpt-4.1",
    tools=[WebSearchTool(), get_stock_data_tool],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

Router agent

This agent answers questions about investing using multiple agents. A central router agent chooses which worker to use.

  1. Research Agent: Searches the web for information about stocks and ETFs.

  2. Question Answering Agent: Answers questions about investing like Warren Buffett.

qa_agent = Agent(
    name="Investing Q&A Agent",
    instructions="You are Warren Buffett. You are answering questions about investing.",
    model="gpt-4.1",
)

research_agent = Agent(
    name="Financial Search Agent",
    instructions="You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst.",
    model="gpt-4.1",
    tools=[WebSearchTool()],
)

orchestrator_agent = Agent(
    name="Routing Agent",
    instructions="You are a senior financial analyst. Your task is to handoff to the appropriate agent or tool.",
    model="gpt-4.1",
    handoffs=[research_agent,qa_agent],
)

Evaluator-Optimizer

When creating LLM outputs, often times the first generation is unsatisfactory. You can use an agentic loop to iteratively improve the output by asking an LLM to give feedback, and then use the feedback to improve the output.

This agent pattern creates reports and evaluates itself to improve its output.

  1. Report Agent (Generation): Creates a report on a particular stock ticker.

  2. Evaluator Agent (Feedback): Evaluates the report and provides feedback on what to improve.

class EvaluationFeedback(BaseModel):
    feedback: str = Field(
        description=f"What is missing from the research report on positive and negative catalysts for a particular stock ticker. Catalysts include changes in {CATALYSTS}.")
    score: Literal["pass", "needs_improvement", "fail"] = Field(
        description="A score on the research report. Pass if the report is complete and contains at least 3 positive and 3 negative catalysts for the right stock ticker, needs_improvement if the report is missing some information, and fail if the report is completely wrong.")


report_agent = Agent(
    name="Catalyst Report Agent",
    instructions=dedent(
        """You are a research assistant specializing in stock research. Given a stock ticker, generate a report of 3 positive and 3 negative catalysts that could move the stock price in the future in 50 words or less."""
    ),
    model="gpt-4.1",
)

evaluation_agent = Agent(
    name="Evaluation Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
    ),
    model="gpt-4.1",
    output_type=EvaluationFeedback,
)

Orchestrator worker

This is the most advanced pattern in the examples, using orchestrators and workers together. The orchestrator chooses which worker to use for a specific sub-task. The worker attempts to complete the sub-task and return a result. The orchestrator then uses the result to choose the next worker to use until a final result is returned.

In the following example, we'll build an agent which creates a portfolio of stocks and ETFs based on a user's investment strategy.

  1. Orchestrator: Chooses which worker to use based on the user's investment strategy.

  2. Research Agent: Searches the web for information about stocks and ETFs that could support the user's investment strategy.

  3. Evaluation Agent: Evaluates the research report and provides feedback on what data is missing.

  4. Portfolio Agent: Creates a portfolio of stocks and ETFs based on the research report.

evaluation_agent = Agent(
    name="Evaluation Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report with positive and negative catalysts. Your task is to evaluate the report and provide feedback on what to improve."""
    ),
    model="gpt-4.1",
    output_type=EvaluationFeedback,
)

portfolio_agent = Agent(
    name="Portfolio Agent",
    instructions=dedent(
        """You are a senior financial analyst. You will be provided with a stock research report. Your task is to create a portfolio of stocks and ETFs that could support the user's stated investment strategy. Include facts and data from the research report in the stated reasons for the portfolio allocation."""
    ),
    model="o4-mini",
    output_type=Portfolio,
)

research_agent = Agent(
    name="FinancialSearchAgent",
    instructions=dedent(
        """You are a research assistant specializing in financial topics. Given a stock ticker, use web search to retrieve up‑to‑date context and produce a short summary of at most 50 words. Focus on key numbers, events, or quotes that will be useful to a financial analyst."""
    ),
    model="gpt-4.1",
    tools=[WebSearchTool()],
    model_settings=ModelSettings(tool_choice="required", parallel_tool_calls=True),
)

orchestrator_agent = Agent(
    name="Routing Agent",
    instructions=dedent("""You are a senior financial analyst. You are trying to create a portfolio based on my stated investment strategy. Your task is to handoff to the appropriate agent or tool.

    First, handoff to the research_agent to give you a report on stocks and ETFs that could support the user's stated investment strategy.
    Then, handoff to the evaluation_agent to give you a score on the research report. If the evaluation_agent returns a needs_improvement or fail, continue using the research_agent to gather more information.
    Once the evaluation_agent returns a pass, handoff to the portfolio_agent to create a portfolio."""),
    model="gpt-4.1",
    handoffs=[
        research_agent,
        evaluation_agent,
        portfolio_agent,
    ],
)

This uses the following structured outputs.

class PortfolioItem(BaseModel):
    ticker: str = Field(description="The ticker of the stock or ETF.")
    allocation: float = Field(
        description="The percentage allocation of the ticker in the portfolio. The sum of all allocations should be 100."
    )
    reason: str = Field(description="The reason why this ticker is included in the portfolio.")


class Portfolio(BaseModel):
    tickers: list[PortfolioItem] = Field(
        description="A list of tickers that could support the user's stated investment strategy."
    )


class EvaluationFeedback(BaseModel):
    feedback: str = Field(
        description="What data is missing in order to create a portfolio of stocks and ETFs based on the user's investment strategy."
    )
    score: Literal["pass", "needs_improvement", "fail"] = Field(
        description="A score on the research report. Pass if you have at least 5 tickers with data that supports the user's investment strategy to create a portfolio, needs_improvement if you do not have enough supporting data, and fail if you have no tickers."
    )
Google Colab
Google Colab
Google Colab
Google Colab

is a lightweight Python library for building agentic AI apps. It includes a few abstractions:

First-class support with OpenAI LLMs. LLMs that do not support json_schema as a parameter are .

OpenAI-Agents
Logo
not supported
Logo
Logo
Logo
Logo
Google Colab
Logo
Logo
Logo
Logo
Prompt Chaining Flow
Orchestrator Flow
Parallelization Flow
Eval-Optimizer Flow
phoenix/tutorials/agents/openai/openai_agents_evaluator_optimizer.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/openai/openai_agents_prompt_chaining.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/openai/openai_agents_parallelization.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/openai/openai_agents_routing.ipynb at main · Arize-ai/phoenixGitHub
phoenix/tutorials/agents/openai/openai_agents_orchestrator.ipynb at main · Arize-ai/phoenixGitHub
Reviewing Candidate Profiles in Parallel
phoenix/tutorials/agents/openai/openai_agents_basic.ipynb at main · Arize-ai/phoenixGitHub
Logo
Logo
Logo
Logo
Logo
Logo