Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Not sure where to start? Try a quickstart:
Track and analyze multi-turn conversations
Sessions enable tracking and organizing related traces across multi-turn conversations with your AI application. When building conversational AI, maintaining context between interactions is critical - Sessions make this possible from an observability perspective.
With Sessions in Phoenix, you can:
Track the entire history of a conversation in a single thread
View conversations in a chatbot-like UI showing inputs and outputs of each turn
Search through sessions to find specific interactions
Track token usage and latency per conversation
This feature is particularly valuable for applications where context builds over time, like chatbots, virtual assistants, or any other multi-turn interaction. By tagging spans with a consistent session ID, you create a connected view that reveals how your application performs across an entire user journey.
Learn how to use the phoenix.otel
library
Learn how you can use basic OpenTelemetry to instrument your application.
Learn how to use Phoenix's decorators to easily instrument specific methods or code blocks in your application.
Setup tracing for your TypeScript application.
Learn about Projects in Phoenix, and how to use them.
Understand Sessions and how they can be used to group user conversations.
Tracing is a critical part of AI Observability and should be used both in production and development
Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.
This section contains details on Tracing features:
Learn how to block PII from logging to Phoenix
Learn how to selectively block or turn off tracing
Learn how to send only certain spans to Phoenix
Learn how to trace images
Tracing can be paused temporarily or disabled permanently.
If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing
context manager as shown below.
Calling .uninstrument()
on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.
Learn more about options.
Check out how to
Tracing can be augmented and customized by adding Metadata. Metadata includes your own custom attributes, user ids, session ids, prompt templates, and more.
Add Attributes, Metadata, Users
Learn how to add custom metadata and attributes to your traces
Instrument Prompt Templates and Prompt Variables
Learn how to define custom prompt templates and variables in your tracing.
Use projects to organize your LLM traces
Projects provide organizational structure for your AI applications, allowing you to logically separate your observability data. This separation is essential for maintaining clarity and focus.
With Projects, you can:
Segregate traces by environment (development, staging, production)
Isolate different applications or use cases
Track separate experiments without cross-contamination
Maintain dedicated evaluation spaces for specific initiatives
Create team-specific workspaces for collaborative analysis
Projects act as containers that keep related traces and conversations together while preventing them from interfering with unrelated work. This organization becomes increasingly valuable as you scale - allowing you to easily switch between contexts without losing your place or mixing data.
The Project structure also enables comparative analysis across different implementations, models, or time periods. You can run parallel versions of your application in separate projects, then analyze the differences to identify improvements or regressions.
In order to improve your LLM application iteratively, it's vital to collect feedback, annotate data during human review, as well as to establish an evaluation pipeline so that you can monitor your application. In Phoenix we capture this type of feedback in the form of annotations.
Phoenix gives you the ability to annotate traces with feedback from the UI, your application, or wherever you would like to perform evaluation. Phoenix's annotation model is simple yet powerful - given an entity such as a span that is collected, you can assign a label
and/or a score
to that entity.
Navigate to the Feedback tab in this demo trace to see how LLM-based evaluations appear in Phoenix:
Learn more about the concepts Concepts: Annotations
Configure Annotation Configs to guide human annotations.
How to run Running Evals on Traces
Learn how to log annotations via the client from your app or in a notebook
Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:
Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness
Phoenix allows you to annotate traces through the Client, the REST API, or the UI.
To learn how to configure annotations and to annotate through the UI, see Annotating in the UI
To learn how to add human labels to your traces, either manually or programmatically, see Annotating via the Client
To learn how to evaluate traces captured in Phoenix, see Running Evals on Traces
To learn how to upload your own evaluation labels into Phoenix, see Log Evaluation Results
For more background on the concept of annotations, see Annotations
Before accessing px.Client(), be sure you've set the following environment variables:
If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.
You can also launch a temporary version of Phoenix in your local notebook to quickly view the traces. But be warned, this Phoenix instance will only last as long as your notebook environment is runing
Version and track changes made to prompt templates
Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.
Key benefits of prompt management include:
Reusability: Store and load prompts across different use cases.
Versioning: Track changes over time to ensure that the best performing version is deployed for use in your application.
Collaboration: Share prompts with others to maintain consistency and facilitate iteration.
To learn how to get started with prompt management, see Create a prompt
Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.
Unlike traditional software, AI applications are non-deterministic and depend on natural language to provide context and guide model output. The pieces of natural language and associated model parameters embedded in your program are known as “prompts.”
Optimizing your prompts is typically the highest-leverage way to improve the behavior of your application, but “prompt engineering” comes with its own set of challenges. You want to be confident that changes to your prompts have the intended effect and don’t introduce regressions.
To get started, jump to Quickstart: Prompts.
Phoenix offers a comprehensive suite of features to streamline your prompt engineering workflow.
Pull and push prompt changes via Phoenix's Python and TypeScript Clients
Using Phoenix as a backend, Prompts can be managed and manipulated via code by using our Python or TypeScript SDKs.
With the Phoenix Client SDK you can:
To learn more about managing Prompts in code, see Using a prompt
Testing your prompts before you ship them is vital to deploying reliable AI applications
The Playground is a fast and efficient way to refine prompt variations. You can load previous prompts and validate their performance by applying different variables.
Each single-run test in the Playground is recorded as a span in the Playground project, allowing you to revisit and analyze LLM invocations later. These spans can be added to datasets or reloaded for further testing.
The ideal way to test a prompt is to construct a golden dataset where the dataset examples contains the variables to be applied to the prompt in the inputs and the outputs contains the ideal answer you want from the LLM. This way you can run a given prompt over N number of examples all at once and compare the synthesized answers against the golden answers.
Prompt Playground supports side-by-side comparisons of multiple prompt variants. Click + Compare to add a new variant. Whether using Span Replay or testing prompts over a Dataset, the Playground processes inputs through each variant and displays the results for easy comparison.
Replay LLM spans traced in your application directly in the playground
Have you ever wanted to go back into a multi-step LLM chain and just replay one step to see if you could get a better outcome? Well you can with Phoenix's Span Replay. LLM spans that are stored within Phoenix can be loaded into the Prompt Playground and replayed. Replaying spans inside of Playground enables you to debug and improve the performance of your LLM systems by comparing LLM provider outputs, tweaking model parameters, changing prompt text, and more.
Chat completions generated inside of Playground are automatically instrumented, and the recorded spans are immediately available to be replayed inside of Playground.
This example:
Continuously queries a LangChain application to send new traces and spans to your Phoenix session
Queries new spans once per minute and runs evals, including:
Hallucination
Q&A Correctness
Relevance
Logs evaluations back to Phoenix so they appear in the UI
The evaluation script is run as a cron job, enabling you to adjust the frequency of the evaluation job:
The above script can be run periodically to augment Evals in Phoenix.
The Phoenix app can be run in various environments such as Colab and SageMaker notebooks, as well as be served via the terminal or a docker container.
If you're using Phoenix Cloud, be sure to set the proper environment variables to connect to your instance:
To start phoenix in a notebook environment, run:
This will start a local Phoenix server. You can initialize the phoenix server with various kinds of data (traces, inferences).
If you want to start a phoenix server to collect traces, you can also run phoenix directly from the command line:
This will start the phoenix server on port 6006. If you are running your instrumented notebook or application on the same machine, traces should automatically be exported to http://127.0.0.1:6006
so no additional configuration is needed. However if the server is running remotely, you will have to modify the environment variable PHOENIX_COLLECTOR_ENDPOINT
to point to that machine (e.g. http://<my-remote-machine>:<port>
)
Tracing the execution of LLM applications using Telemetry
Phoenix traces AI applications, via OpenTelemetry and has first-class integrations with LlamaIndex, Langchain, OpenAI, and others.
LLM tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation to provide a detailed timeline of the request's execution.
Using Phoenix's tracing capabilities can provide important insights into the inner workings of your LLM application. By analyzing the collected trace data, you can identify and address various performance and operational issues and improve the overall reliability and efficiency of your system.
Application Latency: Identify and address slow invocations of LLMs, Retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
Token Usage: Gain a detailed breakdown of token usage for your LLM calls, allowing you to identify and optimize the most expensive LLM invocations.
Runtime Exceptions: Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
Retrieved Documents: Inspect the documents retrieved during a Retriever call, including the score and order in which they were returned to provide insight into the retrieval process.
Embeddings: Examine the embedding text used for retrieval and the underlying embedding model to allow you to validate and refine your embedding strategies.
LLM Parameters: Inspect the parameters used when calling an LLM, such as temperature and system prompts, to ensure optimal configuration and debugging.
Prompt Templates: Understand the prompt templates used during the prompting step and the variables that were applied, allowing you to fine-tune and improve your prompting strategies.
Tool Descriptions: View the descriptions and function signatures of the tools your LLM has been given access to in order to better understand and control your LLM’s capabilities.
LLM Function Calls: For LLMs with function call capabilities (e.g., OpenAI), you can inspect the function selection and function messages in the input to the LLM, further improving your ability to debug and optimize your application.
By using tracing in Phoenix, you can gain increased visibility into your LLM application, empowering you to identify and address performance bottlenecks, optimize resource utilization, and ensure the overall reliability and effectiveness of your system.
Phoenix uses projects to group traces. If left unspecified, all traces are sent to a default project.
Projects work by setting something called the Resource attributes (as seen in the OTEL example above). The phoenix server uses the project name attribute to group traces into the appropriate project.
Typically you want traces for an LLM app to all be grouped in one project. However, while working with Phoenix inside a notebook, we provide a utility to temporarily associate spans with different projects. You can use this to trace things like evaluations.
How to annotate traces in the UI for analysis and dataset curation
To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings
and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.
Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate
button and fill out the form to rate different steps in your AI application.
You can also take notes as you go by either clicking on the explain
link or by adding your notes to the bottom messages UI.
You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations
tab.
Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.
As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations
tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.
Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.
Span annotations can be an extremely valuable basis for improving your application. The Phoenix client provides useful ways to pull down spans and their associated annotations. This information can be used to:
build new LLM judges
form the basis for new datasets
help identify ideas for improving your application
If you only want the spans that contain a specific annotation, you can pass in a query that filters on annotation names, scores, or labels.
The queries can also filter by annotation scores and labels.
This spans dataframe can be used to pull associated annotations.
Instead of an input dataframe, you can also pass in a list of ids:
The annotations and spans dataframes can be easily joined to produce a one-row-per-annotation dataframe that can be used to analyze the annotations!
The velocity of AI application development is bottlenecked by quality evaluations because AI engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. High quality evaluations are critical as they can help developers answer these types of questions with greater confidence.
Datasets are integral to evaluation. They are collections of examples that provide the inputs
and, optionally, expected reference
outputs for assessing your application. Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are used to run experiments and evaluations to track improvements to your prompt, LLM, or other parts of your LLM application.
In AI development, it's hard to understand how a change will affect performance. This breaks the dev flow, making iteration more guesswork than engineering.
Experiments and evaluations solve this, helping distill the indeterminism of LLMs into tangible feedback that helps you ship more reliable product.
Specifically, good evals help you:
Understand whether an update is an improvement or a regression
Drill down into good / bad examples
Compare specific examples vs. prior runs
Avoid guesswork
Want to just use the contents of your dataset in another context? Simply click on the export to CSV button on the dataset page and you are good to go!
Fine-tuning lets you get more out of the models available by providing:
Higher quality results than prompting
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests. Phoenix natively exports OpenAI Fine-Tuning JSONL as long as the dataset contains compatible inputs and outputs.
Learn how to load a file of traces into Phoenix
Learn how to export trace data from Phoenix
When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.
One way of doing this is to calculate convergence:
Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score: avg(minimum steps taken / steps taken for this run)
This will give a convergence score of 0-1, with 1 being a perfect score.
Sometimes while instrumenting your application, you may want to filter out or modify certain spans from being sent to Phoenix. For example, you may want to filter out spans that are that contain sensitive information or contain redundant information.
To do this, you can use a custom SpanProcessor
and attach it to the OpenTelemetry TracerProvider
.
Datasets are critical assets for building robust prompts, evals, fine-tuning,
Phoenix Evals come with:
Speed - Phoenix evals are designed for maximum speed and throughput. Evals run in batches and typically run 10x faster than calling the APIs directly.
The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.
The following is the structure of the EMOTION_PROMPT_TEMPLATE
:
The prompt and evaluation logic are part of the phoenix.evals.default_audio_templates
module and are defined as:
EMOTION_AUDIO_RAILS
: Output options for the evaluation template.
EMOTION_PROMPT_TEMPLATE
: Prompt used for evaluating audio emotions.
Use to mark functions and code blocks.
Use to capture all calls made to supported frameworks.
Use instrumentation. Supported in and
Phoenix supports loading data that contains . This allows you to load an existing dataframe of traces into your Phoenix instance.
Usually these will be traces you've previously saved using .
- Create, store, modify, and deploy prompts for interacting with LLMs
- Play with prompts, models, invocation parameters and track your progress via tracing and experiments
- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.
Phoenix's Prompt Playground makes the process of iterating and testing prompts quick and easy. Phoenix's playground supports (OpenAI, Anthropic, Gemini, Azure) as well as custom model endpoints, making it the ideal prompt IDE for you to build experiment and evaluate prompts and models for your task.
Speed: Rapidly test variations in the , model, invocation parameters, , and output format.
Reproducibility: All runs of the playground are , unlocking annotations and evaluation.
Datasets: Use as a fixture to run a prompt variant through its paces and to evaluate it systematically.
Prompt Management: directly within the playground.
prompts dynamically
templates by name, version, or tag
templates with runtime variables and use them in your code. Native support for OpenAI, Anthropic, Gemini, Vercel AI SDK, and more. No propriatry client necessary.
Support for and Execute tools defined within the prompt. Phoenix prompts encompasses more than just the text and messages.
Playground integrates with to help you iterate and incrementally improve your prompts. Experiment runs are automatically recorded and available for subsequent evaluation to help you understand how changes to your prompts, LLM model, or invocation parameters affect performance.
Sometimes you may want to test a prompt and run evaluations on a given prompt. This can be particularly useful when custom manipulation is needed (e.x. you are trying to iterate on a system prompt on a variety of different chat messages). This tutorial is coming soon
Prompts in Phoenix can be created, iterated on, versioned, tagged, and used either via the UI or our Python/TS SDKs. The UI option also includes our , which allows you to compare prompt variations side-by-side in the Phoenix UI.
You can use cron to run evals client-side as your traces and spans are generated, augmenting your dataset with evaluations in an online manner. View the .
If you are set up, see to start using Phoenix in your preferred environment.
provides free-to-use Phoenix instances that are preconfigured for you with 10GBs of storage space. Phoenix Cloud instances are a great starting point, however if you need more storage or more control over your instance, self-hosting options could be a better fit.
See .
Tracing is a helpful tool for understanding how your LLM application works. Phoenix offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework. Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks ( , ,), SDKs (, , , ), and Languages. (Python, Javascript, etc.)
To get started, check out the .
Read more about and
Check out the for specific tutorials.
In the notebook, you can set the PHOENIX_PROJECT_NAME
environment variable before adding instrumentation or running any of your code.
In python this would look like:
Note that setting a project via an environment variable only works in a notebook and must be done BEFORE instrumentation is initialized. If you are using OpenInference Instrumentation, see the Server tab for how to set the project name in the Resource attributes.
Alternatively, you can set the project name in your register
function call:
If you are using Phoenix as a collector and running your application separately, you can set the project name in the Resource
attributes for the trace provider.
Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. OpenAI Evals offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals. Phoenix can natively export the OpenAI Evals format as JSONL so you can use it with OpenAI Evals. See for details.
- how to configure API keys for OpenAI, Anthropic, Gemini, and more.
- how to create, update, and track prompt changes
- how to test changes to a prompt in the playground and in the notebook
- how to mark certain prompt versions as ready for
- how to integrate prompts into your code and experiments
- how to setup the playground and how to test prompt changes via datasets and experiments.
In this example, we're filtering out any spans that have the name "secret_span" by bypassing the on_start
and on_end
hooks of the inherited BatchSpanProcessor
.
Notice that this logic can be extended to modify a span and redact sensitive information if preserving the span is preferred.
- how to quickly download a dataset to use elsewhere
- want to fine tune an LLM for better accuracy and cost? Export llm examples for fine-tuning.
- have some good examples to use for benchmarking of llms using OpenAI evals? export to OpenAI evals format.
The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix is designed for simple, fast, and accurate .
Pre-built evals - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates . Each eval is pre-tested on a variety of eval models. Find the most up-to-date template on .
Run evals on your own data - takes a dataframe as its primary input and output, making it easy to run evaluations on your own data - whether that's logs, traces, or datasets downloaded for benchmarking.
Built-in Explanations - All Phoenix evaluations include an that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.
- Phoenix let's you configure which foundation model you'd like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See
This prompt template is heavily inspired by the paper: .
SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.
Example of a Question: How many artists have names longer than 10 characters?
Example Query Generated:
SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10
The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.
Run evaluations via a job to visualize in the UI as traces stream in.
Evaluate traces captured in Phoenix and export results to the Phoenix UI.
Evaluate tasks with multiple inputs/outputs (ex: text, audio, image) using versatile evaluation tasks.
Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.
The following is an example of code snippet showing how to use the eval above template:
How to import prompt and response from Large Large Model (LLM)
Below shows a relevant subsection of the dataframe. The embedding
of the prompt is also shown.
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
Neil Alden Armstrong
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
Francis Michael Forde
Define the inferences by pairing the dataframe with the schema.
How to create Phoenix inferences and schemas for the corpus data
Below is an example dataframe containing Wikipedia articles along with its embedding vector.
1
Voyager 2 is a spacecraft used by NASA to expl...
[-0.02785328, -0.04709944, 0.042922903, 0.0559...
2
The Staturn Nebula is a planetary nebula in th...
[0.03544901, 0.039175965, 0.014074919, -0.0307...
3
Eris is a dwarf planet and a trans-Neptunian o...
[0.05506449, 0.0031612846, -0.020452883, -0.02...
Below is an appropriate schema for the dataframe above. It specifies the id
column and that embedding
belongs to text
. Other columns, if exist, will be detected automatically, and need not be specified by the schema.
Define the inferences by pairing the dataframe with the schema.
Many LLM applications use a technique called Retrieval Augmented Generation. These applications retrieve data from their knowledge base to help the LLM accomplish tasks with the appropriate context.
However, these retrieval systems can still hallucinate or provide answers that are not relevant to the user's input query. We can evaluate retrieval systems by checking for:
Are there certain types of questions the chatbot gets wrong more often?
Are the documents that the system retrieves irrelevant? Do we have the right documents to answer the question?
Does the response match the provided documents?
Phoenix supports retrievals troubleshooting and evaluation on both traces and inferences, but inferences are currently required to visualize your retrievals using a UMAP. See below on the differences.
Troubleshooting for LLM applications
✅
✅
Follow the entirety of an LLM workflow
✅
🚫 support for spans only
Embeddings Visualizer
🚧 on the roadmap
✅
How to import data for the Retrieval-Augmented Generation (RAG) use case
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
[7395, 567965, 323794, ...
[11.30, 7.67, 5.85, ...
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
[38906, 38909, 38912, ...
[11.28, 9.10, 8.39, ...
why is amino group in aniline an ortho para di...
[-0.0431, -0.0407, -0.0597, ...
[779579, 563725, 309367, ...
[-10.89, -10.90, -10.94, ...
Both the retrievals
and scores
are grouped under prompt_column_names
along with the embedding
of the query
.
Define the inferences by pairing the dataframe with the schema.
AI Observability and Evaluation
Running Phoenix for the first time? Select a quickstart below.
Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.
Add instrumentation for popular packages and libraries such as OpenAI, LangGraph, Vercel AI SDK and more.
Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.
Phoenix supports three main options to collect traces:
This example uses options 1 and 2.
To collect traces from your application, you must configure an OpenTelemetry TracerProvider to send traces to Phoenix.
Functions can be traced using decorators:
Input and output attributes are set automatically based on my_func
's parameters and return.
OpenInference libraries must be installed before calling the register function
You should now see traces in Phoenix!
phoenix.otel
is a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults.
These defaults are aware of environment variables you may have set to configure Phoenix:
PHOENIX_COLLECTOR_ENDPOINT
PHOENIX_PROJECT_NAME
PHOENIX_CLIENT_HEADERS
PHOENIX_API_KEY
PHOENIX_GRPC_PORT
phoenix.otel.register
The phoenix.otel
module provides a high-level register
function to configure OpenTelemetry tracing by setting a global TracerProvider
. The register function can also configure headers and whether or not to process spans one by one or by batch.
If the PHOENIX_API_KEY
environment variable is set, register
will automatically add an authorization
header to each span payload.
There are two ways to configure the collector endpoint:
Using environment variables
Using the endpoint
keyword argument
If you're setting the PHOENIX_COLLECTOR_ENDPOINT
environment variable, register
will
automatically try to send spans to your Phoenix server using gRPC.
endpoint
directlyWhen passing in the endpoint
argument, you must specify the fully qualified endpoint. If the PHOENIX_GRPC_PORT
environment variable is set, it will override the default gRPC port.
The HTTP transport protocol is inferred from the endpoint
The GRPC transport protocol is inferred from the endpoint
Additionally, the protocol
argument can be used to enforce the OTLP transport protocol regardless of the endpoint. This might be useful in cases such as when the GRPC endpoint is bound to a different port than the default (4317). The valid protocols are: "http/protobuf"
, and "grpc"
.
register
can be configured with different keyword arguments:
project_name
: The Phoenix project name
or use PHOENIX_PROJECT_NAME
env. var
headers
: Headers to send along with each span payload
or use PHOENIX_CLIENT_HEADERS
env. var
batch
: Whether or not to process spans in batch
Once you've connected your application to your Phoenix instance using phoenix.otel.register
, you need to instrument your application. You have a few options to do this:
In some situations, you may need to modify the observability level of your tracing. For instance, you may want to keep sensitive information from being logged for security reasons, or you may want to limit the size of the base64 encoded images logged to reduced payload size.
The OpenInference Specification defines a set of environment variables you can configure to suit your observability needs. In addition, the OpenInference auto-instrumentors accept a trace config which allows you to set these value in code without having to set environment variables, if that's what you prefer
The possible settings are:
To set up this configuration you can either:
Set environment variables as specified above
Define the configuration in code as shown below
Do nothing and fall back to the default values
Use a combination of the three, the order of precedence is:
Values set in the TraceConfig
in code
Environment variables
default values
Below is an example of how to set these values in code using our OpenAI Python and JavaScript instrumentors, however, the config is respected by all of our auto-instrumentors.
Prompt playground can be accessed from the left navbar of Phoenix.
From here, you can directly prompt your model by modifying either the system or user prompt, and pressing the Run button on the top right.
Let's start by comparing a few different prompt variations. Add two additional prompts using the +Prompt button, and update the system and user prompts like so:
System prompt #1:
System prompt #2:
System prompt #3:
User prompt (use this for all three):
Your playground should look something like this:
Let's run it and compare results:
Your prompt will be saved in the Prompts tab:
Now you're ready to see how that prompt performs over a larger dataset of examples.
Next, create a new dataset from the Datasets tab in Phoenix, and specify the input and output columns like so:
Now we can return to Prompt Playground, and this time choose our new dataset from the "Test over dataset" dropdown.
You can also load in your saved Prompt:
We'll also need to update our prompt to look for the {{input_article}}
column in our dataset. After adding this in, be sure to save your prompt once more!
Now if we run our prompt(s), each row of the dataset will be run through each variation of our prompt.
And if you return to view your dataset, you'll see the details of that run saved as an experiment.
You can now easily modify you prompt or compare different versions side-by-side. Let's say you've found a stronger version of the prompt. Save your updated prompt once again, and you'll see it added as a new version under your existing prompt:
You can also tag which version you've deemed ready for production, and view code to access your prompt in code further down the page.
Now you're ready to create, test, save, and iterate on your Prompts in Phoenix! Check out our other quickstarts to see how to use Prompts in code.
General guidelines on how to use Phoenix's prompt playground
If successful you should see the LLM output stream out in the Output section of the UI.
Every prompt instance can be configured to use a specific LLM and set of invocation parameters. Click on the model configuration button at the top of the prompt editor and configure your LLM of choice. Click on the "save as default" option to make your configuration sticky across playground sessions.
The Prompt Playground offers the capability to compare multiple prompt variants directly within the playground. Simply click the + Compare button at the top of the first prompt to create duplicate instances. Each prompt variant manages its own independent template, model, and parameters. This allows you to quickly compare prompts (labeled A, B, C, and D in the UI) and run experiments to determine which prompt and model configuration is optimal for the given task.
All invocations of an LLM via the playground is recorded for analysis, annotations, evaluations, and dataset curation.
If you simply run an LLM in the playground using the free form inputs (e.g. not using a dataset), Your spans will be recorded in a project aptly titled "playground".
If however you run a prompt over dataset examples, the outputs and spans from your playground runs will be captured as an experiment. Each experiment will be named according to the prompt you ran the experiment over.
We are continually iterating our templates, view the most up-to-date template .
(llm_classify)
(llm_generate)
We are continually iterating our templates, view the most up-to-date template .
For the Retrieval-Augmented Generation (RAG) use case, see the section.
See for the Retrieval-Augmented Generation (RAG) use case where relevant documents are retrieved for the question before constructing the context for the LLM.
In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. Corpus inferences can be imported into Phoenix as shown below.
The launcher accepts the corpus dataset through corpus=
parameter.
Check out our to get started. Look at our to better understand how to troubleshoot and evaluate different kinds of retrieval systems. For a high level overview on evaluation, check out our .
In Retrieval-Augmented Generation (RAG), the retrieval step returns from a (proprietary) knowledge base (a.k.a. ) a list of documents relevant to the user query, then the generation step adds the retrieved documents to the prompt context to improve response accuracy of the Large Language Model (LLM). The IDs of the retrieval documents along with the relevance scores, if present, can be imported into Phoenix as follows.
Below shows only the relevant subsection of the dataframe. The retrieved_document_ids
should matched the id
s in the data. Note that for each row, the list under the relevance_scores
column have a matching length as the one under the retrievals
column. But it's not necessary for all retrieval lists to have the same length.
Phoenix is an open-source observability tool designed for experimentation, evaluation, and troubleshooting of AI and LLM applications. It allows AI engineers and data scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by , the company behind the industry-leading AI observability platform, and a set of core contributors.
Phoenix works with OpenTelemetry and instrumentation. See for details.
Phoenix offers tools to workflow.
- Create, store, modify, and deploy prompts for interacting with LLMs
- Play with prompts, models, invocation parameters and track your progress via tracing and experiments
- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.
Use to mark functions and code blocks.
Use to capture all calls made to supported frameworks.
Use instrumentation. Supported in and , among many other languages.
Sign up for an Arize Phoenix account at
Grab your API key from the Keys option on the left bar.
In your code, set your endpoint and API key:
Having trouble finding your endpoint? Check out
Run Phoenix using Docker, local terminal, Kubernetes etc. For more information, .
In your code, set your endpoint:
Having trouble finding your endpoint? Check out
Phoenix can also capture all calls made to supported libraries automatically. Just install the :
Explore tracing
View use cases to see
Using OpenInference auto-instrumentors. If you've used the auto_instrument
flag above, then any instrumentor packages in your environment will be called automatically. For a full list of OpenInference packages, see
Using .
Using .
Instrumenting prompt templates and variables allows you to track and visualize prompt changes. These can also be combined with to measure the performance changes driven by each of your prompts.
We provide a using_prompt_template
context manager to add a prompt template (including its version and variables) to the current OpenTelemetry Context. OpenInference will read this Context and pass the prompt template fields as span attributes, following the OpenInference . Its inputs must be of the following type:
Template: non-empty string.
Version: non-empty string.
Variables: a dictionary with string keys. This dictionary will be serialized to JSON when saved to the OTEL Context and remain a JSON string when sent as a span attribute.
It can also be used as a decorator:
template - a string with templated variables ex. "hello {{name}}"
variables - an object with variable names and their values ex. {name: "world"}
version - a string version of the template ex. v1.0
All of these are optional. Application of variables to a template will typically happen before the call to an llm and may not be picked up by auto instrumentation. So, this can be helpful to add to ensure you can see the templates and variables while troubleshooting.
It looks like the second option is doing the most concise summary. Go ahead and .
Prompt playground can be used to run a series of dataset rows through your prompts. To start off, we'll need a dataset. Phoenix has many options to , to keep things simple here, we'll directly upload a CSV. Download the articles summaries file linked below:
From here, you could to test its performance, or add complexity to your prompts by including different tools, output schemas, and models to test against.
To first get started, you will first . In the playground view, create a valid prompt for the LLM and click Run on the top right (or the mod + enter
)
The prompt editor (typically on the left side of the screen) is where you define the . You select the template language (mustache or f-string) on the toolbar. Whenever you type a variable placeholder in the prompt (say {{question}} for mustache), the variable to fill will show up in the inputs section. Input variables must either be filled in by hand or can be filled in via a dataset (where each row has key / value pairs for the input).
Phoenix lets you run a prompt (or multiple prompts) on a dataset. Simply containing the input variables you want to use in your prompt template. When you click Run, Phoenix will apply each configured prompt to every example in the dataset, invoking the LLM for all possible prompt-example combinations. The result of your playground runs will be tracked as an experiment under the loaded dataset (see )
is a helpful tool for understanding how your LLM application works. Phoenix's open-source library offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework.
Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks (, ,), SDKs (, , , ), and Languages. (, , etc.)
Phoenix is built to help you and understand their true performance. To accomplish this, Phoenix includes:
A standalone library to on your own datasets. This can be used either with the Phoenix library, or independently over your own data.
into the Phoenix dashboard. Phoenix is built to be agnostic, and so these evals can be generated using Phoenix's library, or an external library like , , or .
to attach human ground truth labels to your data in Phoenix.
let you test different versions of your application, store relevant traces for evaluation and analysis, and build robust evaluations into your development process.
to test and compare different iterations of your application
, or directly upload Datasets from code / CSV
, export them in fine-tuning format, or attach them to an Experiment.
We provide a setPromptTemplate
function which allows you to set a template, version, and variables on context. You can use this utility in conjunction with to set the active context. OpenInference will then pick up these attributes and add them to any spans created within the context.with
callback. The components of a prompt template are: