Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Not sure where to start? Try a quickstart:
Tracing is a critical part of AI Observability and should be used both in production and development
Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.
This section contains details on Tracing features:
Learn more about options.
Learn how to use the phoenix.otel
library
Learn how you can use basic OpenTelemetry to instrument your application.
Learn how to use Phoenix's decorators to easily instrument specific methods or code blocks in your application.
Setup tracing for your TypeScript application.
Learn about Projects in Phoenix, and how to use them.
Understand Sessions and how they can be used to group user conversations.
Track and analyze multi-turn conversations
Sessions enable tracking and organizing related traces across multi-turn conversations with your AI application. When building conversational AI, maintaining context between interactions is critical - Sessions make this possible from an observability perspective.
With Sessions in Phoenix, you can:
Track the entire history of a conversation in a single thread
View conversations in a chatbot-like UI showing inputs and outputs of each turn
Search through sessions to find specific interactions
Track token usage and latency per conversation
This feature is particularly valuable for applications where context builds over time, like chatbots, virtual assistants, or any other multi-turn interaction. By tagging spans with a consistent session ID, you create a connected view that reveals how your application performs across an entire user journey.
Check out how to Setup Sessions
Use projects to organize your LLM traces
Projects provide organizational structure for your AI applications, allowing you to logically separate your observability data. This separation is essential for maintaining clarity and focus.
With Projects, you can:
Segregate traces by environment (development, staging, production)
Isolate different applications or use cases
Track separate experiments without cross-contamination
Maintain dedicated evaluation spaces for specific initiatives
Create team-specific workspaces for collaborative analysis
Projects act as containers that keep related traces and conversations together while preventing them from interfering with unrelated work. This organization becomes increasingly valuable as you scale - allowing you to easily switch between contexts without losing your place or mixing data.
The Project structure also enables comparative analysis across different implementations, models, or time periods. You can run parallel versions of your application in separate projects, then analyze the differences to identify improvements or regressions.
AI Observability and Evaluation
Running Phoenix for the first time? Select a quickstart below.
Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.
Add instrumentation for popular packages and libraries such as OpenAI, LangGraph, Vercel AI SDK and more.
Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.
The Phoenix app can be run in various environments such as Colab and SageMaker notebooks, as well as be served via the terminal or a docker container.
If you're using Phoenix Cloud, be sure to set the proper environment variables to connect to your instance:
To start phoenix in a notebook environment, run:
This will start a local Phoenix server. You can initialize the phoenix server with various kinds of data (traces, inferences).
If you want to start a phoenix server to collect traces, you can also run phoenix directly from the command line:
This will start the phoenix server on port 6006. If you are running your instrumented notebook or application on the same machine, traces should automatically be exported to http://127.0.0.1:6006
so no additional configuration is needed. However if the server is running remotely, you will have to modify the environment variable PHOENIX_COLLECTOR_ENDPOINT
to point to that machine (e.g. http://<my-remote-machine>:<port>
)
Tracing the execution of LLM applications using Telemetry
Phoenix traces AI applications, via OpenTelemetry and has first-class integrations with LlamaIndex, Langchain, OpenAI, and others.
LLM tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation to provide a detailed timeline of the request's execution.
Using Phoenix's tracing capabilities can provide important insights into the inner workings of your LLM application. By analyzing the collected trace data, you can identify and address various performance and operational issues and improve the overall reliability and efficiency of your system.
Application Latency: Identify and address slow invocations of LLMs, Retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
Token Usage: Gain a detailed breakdown of token usage for your LLM calls, allowing you to identify and optimize the most expensive LLM invocations.
Runtime Exceptions: Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
Retrieved Documents: Inspect the documents retrieved during a Retriever call, including the score and order in which they were returned to provide insight into the retrieval process.
Embeddings: Examine the embedding text used for retrieval and the underlying embedding model to allow you to validate and refine your embedding strategies.
LLM Parameters: Inspect the parameters used when calling an LLM, such as temperature and system prompts, to ensure optimal configuration and debugging.
Prompt Templates: Understand the prompt templates used during the prompting step and the variables that were applied, allowing you to fine-tune and improve your prompting strategies.
Tool Descriptions: View the descriptions and function signatures of the tools your LLM has been given access to in order to better understand and control your LLM’s capabilities.
LLM Function Calls: For LLMs with function call capabilities (e.g., OpenAI), you can inspect the function selection and function messages in the input to the LLM, further improving your ability to debug and optimize your application.
By using tracing in Phoenix, you can gain increased visibility into your LLM application, empowering you to identify and address performance bottlenecks, optimize resource utilization, and ensure the overall reliability and effectiveness of your system.
In order to improve your LLM application iteratively, it's vital to collect feedback, annotate data during human review, as well as to establish an evaluation pipeline so that you can monitor your application. In Phoenix we capture this type of feedback in the form of annotations.
Phoenix gives you the ability to annotate traces with feedback from the UI, your application, or wherever you would like to perform evaluation. Phoenix's annotation model is simple yet powerful - given an entity such as a span that is collected, you can assign a label
and/or a score
to that entity.
Navigate to the Feedback tab in this demo trace to see how LLM-based evaluations appear in Phoenix:
Configure Annotation Configs to guide human annotations.
Learn how to log annotations via the client from your app or in a notebook
Phoenix is an open-source observability tool designed for experimentation, evaluation, and troubleshooting of AI and LLM applications. It allows AI engineers and data scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by , the company behind the industry-leading AI observability platform, and a set of core contributors.
Phoenix works with OpenTelemetry and instrumentation. See Integrations: Tracing for details.
Phoenix offers tools to workflow.
- Create, store, modify, and deploy prompts for interacting with LLMs
- Play with prompts, models, invocation parameters and track your progress via tracing and experiments
- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.
Use to mark functions and code blocks.
Use to capture all calls made to supported frameworks.
Use instrumentation. Supported in and
If you are set up, see to start using Phoenix in your preferred environment.
provides free-to-use Phoenix instances that are preconfigured for you with 10GBs of storage space. Phoenix Cloud instances are a great starting point, however if you need more storage or more control over your instance, self-hosting options could be a better fit.
See .
Tracing is a helpful tool for understanding how your LLM application works. Phoenix offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework. Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks ( , ,), SDKs (, , , ), and Languages. (Python, Javascript, etc.)
To get started, check out the .
Read more about and
Check out the for specific tutorials.
Learn more about the concepts
How to run
is a helpful tool for understanding how your LLM application works. Phoenix's open-source library offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework.
Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks (, ,), SDKs (, , , ), and Languages. (, , etc.)
Phoenix is built to help you and understand their true performance. To accomplish this, Phoenix includes:
A standalone library to on your own datasets. This can be used either with the Phoenix library, or independently over your own data.
into the Phoenix dashboard. Phoenix is built to be agnostic, and so these evals can be generated using Phoenix's library, or an external library like , , or .
to attach human ground truth labels to your data in Phoenix.
let you test different versions of your application, store relevant traces for evaluation and analysis, and build robust evaluations into your development process.
to test and compare different iterations of your application
, or directly upload Datasets from code / CSV
, export them in fine-tuning format, or attach them to an Experiment.
Tracing can be augmented and customized by adding Metadata. Metadata includes your own custom attributes, user ids, session ids, prompt templates, and more.
Add Attributes, Metadata, Users
Learn how to add custom metadata and attributes to your traces
Instrument Prompt Templates and Prompt Variables
Learn how to define custom prompt templates and variables in your tracing.
Learn how to load a file of traces into Phoenix
Learn how to export trace data from Phoenix
Tracing can be paused temporarily or disabled permanently.
If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing
context manager as shown below.
Calling .uninstrument()
on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.
Version and track changes made to prompt templates
Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.
Key benefits of prompt management include:
Reusability: Store and load prompts across different use cases.
Versioning: Track changes over time to ensure that the best performing version is deployed for use in your application.
Collaboration: Share prompts with others to maintain consistency and facilitate iteration.
To learn how to get started with prompt management, see Create a prompt
Replay LLM spans traced in your application directly in the playground
Have you ever wanted to go back into a multi-step LLM chain and just replay one step to see if you could get a better outcome? Well you can with Phoenix's Span Replay. LLM spans that are stored within Phoenix can be loaded into the Prompt Playground and replayed. Replaying spans inside of Playground enables you to debug and improve the performance of your LLM systems by comparing LLM provider outputs, tweaking model parameters, changing prompt text, and more.
Chat completions generated inside of Playground are automatically instrumented, and the recorded spans are immediately available to be replayed inside of Playground.
Want to just use the contents of your dataset in another context? Simply click on the export to CSV button on the dataset page and you are good to go!
Fine-tuning lets you get more out of the models available by providing:
Higher quality results than prompting
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests. Phoenix natively exports OpenAI Fine-Tuning JSONL as long as the dataset contains compatible inputs and outputs.
Phoenix supports three main options to collect traces:
This example uses options 1 and 2.
To collect traces from your application, you must configure an OpenTelemetry TracerProvider to send traces to Phoenix.
Functions can be traced using decorators:
Input and output attributes are set automatically based on my_func
's parameters and return.
OpenInference libraries must be installed before calling the register function
You should now see traces in Phoenix!
Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:
Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness
Phoenix allows you to annotate traces through the Client, the REST API, or the UI.
How to annotate traces in the UI for analysis and dataset curation
To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings
and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.
Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate
button and fill out the form to rate different steps in your AI application.
You can also take notes as you go by either clicking on the explain
link or by adding your notes to the bottom messages UI.
You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations
tab.
Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.
As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations
tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.
Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.
Before accessing px.Client(), be sure you've set the following environment variables:
If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.
You can also launch a temporary version of Phoenix in your local notebook to quickly view the traces. But be warned, this Phoenix instance will only last as long as your notebook environment is runing
Span annotations can be an extremely valuable basis for improving your application. The Phoenix client provides useful ways to pull down spans and their associated annotations. This information can be used to:
build new LLM judges
form the basis for new datasets
help identify ideas for improving your application
If you only want the spans that contain a specific annotation, you can pass in a query that filters on annotation names, scores, or labels.
The queries can also filter by annotation scores and labels.
This spans dataframe can be used to pull associated annotations.
Instead of an input dataframe, you can also pass in a list of ids:
The annotations and spans dataframes can be easily joined to produce a one-row-per-annotation dataframe that can be used to analyze the annotations!
Learn how to block PII from logging to Phoenix
Learn how to selectively block or turn off tracing
Learn how to send only certain spans to Phoenix
Learn how to trace images
Sometimes while instrumenting your application, you may want to filter out or modify certain spans from being sent to Phoenix. For example, you may want to filter out spans that are that contain sensitive information or contain redundant information.
To do this, you can use a custom SpanProcessor
and attach it to the OpenTelemetry TracerProvider
.
Pull and push prompt changes via Phoenix's Python and TypeScript Clients
Using Phoenix as a backend, Prompts can be managed and manipulated via code by using our Python or TypeScript SDKs.
With the Phoenix Client SDK you can:
Testing your prompts before you ship them is vital to deploying reliable AI applications
The Playground is a fast and efficient way to refine prompt variations. You can load previous prompts and validate their performance by applying different variables.
Each single-run test in the Playground is recorded as a span in the Playground project, allowing you to revisit and analyze LLM invocations later. These spans can be added to datasets or reloaded for further testing.
The ideal way to test a prompt is to construct a golden dataset where the dataset examples contains the variables to be applied to the prompt in the inputs and the outputs contains the ideal answer you want from the LLM. This way you can run a given prompt over N number of examples all at once and compare the synthesized answers against the golden answers.
Prompt Playground supports side-by-side comparisons of multiple prompt variants. Click + Compare to add a new variant. Whether using Span Replay or testing prompts over a Dataset, the Playground processes inputs through each variant and displays the results for easy comparison.
The velocity of AI application development is bottlenecked by quality evaluations because AI engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. High quality evaluations are critical as they can help developers answer these types of questions with greater confidence.
Datasets are integral to evaluation. They are collections of examples that provide the inputs
and, optionally, expected reference
outputs for assessing your application. Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are used to run experiments and evaluations to track improvements to your prompt, LLM, or other parts of your LLM application.
In AI development, it's hard to understand how a change will affect performance. This breaks the dev flow, making iteration more guesswork than engineering.
Experiments and evaluations solve this, helping distill the indeterminism of LLMs into tangible feedback that helps you ship more reliable product.
Specifically, good evals help you:
Understand whether an update is an improvement or a regression
Drill down into good / bad examples
Compare specific examples vs. prior runs
Avoid guesswork
Datasets are critical assets for building robust prompts, evals, fine-tuning,
Phoenix Evals come with:
Speed - Phoenix evals are designed for maximum speed and throughput. Evals run in batches and typically run 10x faster than calling the APIs directly.
Run evaluations via a job to visualize in the UI as traces stream in.
Evaluate traces captured in Phoenix and export results to the Phoenix UI.
Evaluate tasks with multiple inputs/outputs (ex: text, audio, image) using versatile evaluation tasks.
Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.
The following is an example of code snippet showing how to use the eval above template:
When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.
One way of doing this is to calculate convergence:
Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score: avg(minimum steps taken / steps taken for this run)
This will give a convergence score of 0-1, with 1 being a perfect score.
The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.
The following is the structure of the EMOTION_PROMPT_TEMPLATE
:
The prompt and evaluation logic are part of the phoenix.evals.default_audio_templates
module and are defined as:
EMOTION_AUDIO_RAILS
: Output options for the evaluation template.
EMOTION_PROMPT_TEMPLATE
: Prompt used for evaluating audio emotions.
This example:
Continuously queries a LangChain application to send new traces and spans to your Phoenix session
Queries new spans once per minute and runs evals, including:
Hallucination
Q&A Correctness
Relevance
Logs evaluations back to Phoenix so they appear in the UI
The evaluation script is run as a cron job, enabling you to adjust the frequency of the evaluation job:
The above script can be run periodically to augment Evals in Phoenix.
Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. OpenAI Evals offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals. Phoenix can natively export the OpenAI Evals format as JSONL so you can use it with OpenAI Evals. See for details.
Use to mark functions and code blocks.
Use to capture all calls made to supported frameworks.
Use instrumentation. Supported in and , among many other languages.
Sign up for an Arize Phoenix account at
Grab your API key from the Keys option on the left bar.
In your code, set your endpoint and API key:
Having trouble finding your endpoint? Check out
Run Phoenix using Docker, local terminal, Kubernetes etc. For more information, .
In your code, set your endpoint:
Having trouble finding your endpoint? Check out
Phoenix can also capture all calls made to supported libraries automatically. Just install the :
Explore tracing
View use cases to see
To learn how to configure annotations and to annotate through the UI, see
To learn how to add human labels to your traces, either manually or programmatically, see
To learn how to evaluate traces captured in Phoenix, see
To learn how to upload your own evaluation labels into Phoenix, see
For more background on the concept of annotations, see
Phoenix supports loading data that contains . This allows you to load an existing dataframe of traces into your Phoenix instance.
Usually these will be traces you've previously saved using .
In this example, we're filtering out any spans that have the name "secret_span" by bypassing the on_start
and on_end
hooks of the inherited BatchSpanProcessor
.
Notice that this logic can be extended to modify a span and redact sensitive information if preserving the span is preferred.
Phoenix's Prompt Playground makes the process of iterating and testing prompts quick and easy. Phoenix's playground supports (OpenAI, Anthropic, Gemini, Azure) as well as custom model endpoints, making it the ideal prompt IDE for you to build experiment and evaluate prompts and models for your task.
Speed: Rapidly test variations in the , model, invocation parameters, , and output format.
Reproducibility: All runs of the playground are , unlocking annotations and evaluation.
Datasets: Use as a fixture to run a prompt variant through its paces and to evaluate it systematically.
Prompt Management: directly within the playground.
To learn more on how to use the playground, see
prompts dynamically
templates by name, version, or tag
templates with runtime variables and use them in your code. Native support for OpenAI, Anthropic, Gemini, Vercel AI SDK, and more. No propriatry client necessary.
Support for and Execute tools defined within the prompt. Phoenix prompts encompasses more than just the text and messages.
To learn more about managing Prompts in code, see
Prompts in Phoenix can be created, iterated on, versioned, tagged, and used either via the UI or our Python/TS SDKs. The UI option also includes our , which allows you to compare prompt variations side-by-side in the Phoenix UI.
- how to configure API keys for OpenAI, Anthropic, Gemini, and more.
- how to create, update, and track prompt changes
- how to test changes to a prompt in the playground and in the notebook
- how to mark certain prompt versions as ready for
- how to integrate prompts into your code and experiments
- how to setup the playground and how to test prompt changes via datasets and experiments.
Playground integrates with to help you iterate and incrementally improve your prompts. Experiment runs are automatically recorded and available for subsequent evaluation to help you understand how changes to your prompts, LLM model, or invocation parameters affect performance.
Sometimes you may want to test a prompt and run evaluations on a given prompt. This can be particularly useful when custom manipulation is needed (e.x. you are trying to iterate on a system prompt on a variety of different chat messages). This tutorial is coming soon
- how to quickly download a dataset to use elsewhere
- want to fine tune an LLM for better accuracy and cost? Export llm examples for fine-tuning.
- have some good examples to use for benchmarking of llms using OpenAI evals? export to OpenAI evals format.
The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix is designed for simple, fast, and accurate .
Pre-built evals - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates . Each eval is pre-tested on a variety of eval models. Find the most up-to-date template on .
Run evals on your own data - takes a dataframe as its primary input and output, making it easy to run evaluations on your own data - whether that's logs, traces, or datasets downloaded for benchmarking.
Built-in Explanations - All Phoenix evaluations include an that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.
- Phoenix let's you configure which foundation model you'd like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See
(llm_classify)
(llm_generate)
We are continually iterating our templates, view the most up-to-date template .
You can use cron to run evals client-side as your traces and spans are generated, augmenting your dataset with evaluations in an online manner. View the .
Phoenix is a comprehensive platform designed to enable observability across every layer of an LLM-based system, empowering teams to build, optimize, and maintain high-quality applications and agents efficiently.
During the development phase, Phoenix offers essential tools for debugging, experimentation, evaluation, prompt tracking, and search and retrieval.
Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.
Leverage experiments to measure prompt and model performance. Typically during this early stage, you'll focus on gather a robust set of test cases and evaluation metrics to test initial iterations of your application. Experiments at this stage may resemble unit tests, as they're geared towards ensure your application performs correctly.
Either as a part of experiments or a standalone feature, evaluations help you understand how your app is performing at a granular level. Typical evaluations might be correctness evals compared against a ground truth data set, or LLM-as-a-judge evals to detect hallucinations or relevant RAG output.
Prompt engineering is critical how a model behaves. While there are other methods such as fine-tuning to change behavior, prompt engineering is the simplest way to get started and often times has the best ROI.
Instrument prompt and prompt variable collection to associate iterations of your app with the performance measured through evals and experiments. Phoenix tracks prompt templates, variables, and versions during execution to help you identify improvements and degradations.
Phoenix's search and retrieval optimization tools include an embeddings visualizer that helps teams understand how their data is being represented and clustered. This visual insight can guide decisions on indexing strategies, similarity measures, and data organization to improve the relevance and efficiency of search results.
In the testing and staging environment, Phoenix supports comprehensive evaluation, benchmarking, and data curation. Traces, experimentation, prompt tracking, and embedding visualizer remain important in the testing and staging phase, helping teams identify and resolve issues before deployment.
With a stable set of test cases and evaluations defined, you can now easily iterate on your application and view performance changes in Phoenix right away. Swap out models, prompts, or pipeline logic, and run your experiment to immediately see the impact on performance.
Phoenix's flexible evaluation framework supports thorough testing of LLM outputs. Teams can define custom metrics, collect user feedback, and leverage separate LLMs for automated assessment. Phoenix offers tools for analyzing evaluation results, identifying trends, and tracking improvements over time.
Phoenix assists in curating high-quality data for testing and fine-tuning. It provides tools for data exploration, cleaning, and labeling, enabling teams to curate representative data that covers a wide range of use cases and edge conditions.
Add guardrails to your application to prevent malicious and erroneous inputs and outputs. Guardrails will be visualized in Phoenix, and can be attached to spans and traces in the same fashion as evaluation metrics.
In production, Phoenix works hand-in-hand with Arize, which focuses on the production side of the LLM lifecycle. The integration ensures a smooth transition from development to production, with consistent tooling and metrics across both platforms.
Phoenix and Arize use the same collector frameworks in development and production. This allows teams to monitor latency, token usage, and other performance metrics, setting up alerts when thresholds are exceeded.
Phoenix's evaluation framework can be used to generate ongoing assessments of LLM performance in production. Arize complements this with online evaluations, enabling teams to set up alerts if evaluation metrics, such as hallucination rates, go beyond acceptable thresholds.
Phoenix and Arize together help teams identify data points for fine-tuning based on production performance and user feedback. This targeted approach ensures that fine-tuning efforts are directed towards the most impactful areas, maximizing the return on investment.
Phoenix, in collaboration with Arize, empowers teams to build, optimize, and maintain high-quality LLM applications throughout the entire lifecycle. By providing a comprehensive observability platform and seamless integration with production monitoring tools, Phoenix and Arize enable teams to deliver exceptional LLM-driven experiences with confidence and efficiency.
Guides on how to use traces
How to set custom attributes and semantic attributes to child spans and spans created by auto-instrumentors.
Create and customize spans for your use-case
How to query spans for to construct DataFrames to use for evaluation
How to log evaluation results to annotate traces with evals
Phoenix uses projects to group traces. If left unspecified, all traces are sent to a default project.
Projects work by setting something called the Resource attributes (as seen in the OTEL example above). The phoenix server uses the project name attribute to group traces into the appropriate project.
Typically you want traces for an LLM app to all be grouped in one project. However, while working with Phoenix inside a notebook, we provide a utility to temporarily associate spans with different projects. You can use this to trace things like evaluations.
Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.
Unlike traditional software, AI applications are non-deterministic and depend on natural language to provide context and guide model output. The pieces of natural language and associated model parameters embedded in your program are known as “prompts.”
Optimizing your prompts is typically the highest-leverage way to improve the behavior of your application, but “prompt engineering” comes with its own set of challenges. You want to be confident that changes to your prompts have the intended effect and don’t introduce regressions.
To get started, jump to Quickstart: Prompts.
Phoenix offers a comprehensive suite of features to streamline your prompt engineering workflow.
General guidelines on how to use Phoenix's prompt playground
To first get started, you will first Configure AI Providers. In the playground view, create a valid prompt for the LLM and click Run on the top right (or the mod + enter
)
If successful you should see the LLM output stream out in the Output section of the UI.
Every prompt instance can be configured to use a specific LLM and set of invocation parameters. Click on the model configuration button at the top of the prompt editor and configure your LLM of choice. Click on the "save as default" option to make your configuration sticky across playground sessions.
The Prompt Playground offers the capability to compare multiple prompt variants directly within the playground. Simply click the + Compare button at the top of the first prompt to create duplicate instances. Each prompt variant manages its own independent template, model, and parameters. This allows you to quickly compare prompts (labeled A, B, C, and D in the UI) and run experiments to determine which prompt and model configuration is optimal for the given task.
All invocations of an LLM via the playground is recorded for analysis, annotations, evaluations, and dataset curation.
If you simply run an LLM in the playground using the free form inputs (e.g. not using a dataset), Your spans will be recorded in a project aptly titled "playground".
If however you run a prompt over dataset examples, the outputs and spans from your playground runs will be captured as an experiment. Each experiment will be named according to the prompt you ran the experiment over.
Many LLM applications use a technique called Retrieval Augmented Generation. These applications retrieve data from their knowledge base to help the LLM accomplish tasks with the appropriate context.
However, these retrieval systems can still hallucinate or provide answers that are not relevant to the user's input query. We can evaluate retrieval systems by checking for:
Are there certain types of questions the chatbot gets wrong more often?
Are the documents that the system retrieves irrelevant? Do we have the right documents to answer the question?
Does the response match the provided documents?
Phoenix supports retrievals troubleshooting and evaluation on both traces and inferences, but inferences are currently required to visualize your retrievals using a UMAP. See below on the differences.
Troubleshooting for LLM applications
✅
✅
Follow the entirety of an LLM workflow
✅
🚫 support for spans only
Embeddings Visualizer
🚧 on the roadmap
✅
SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.
Example of a Question: How many artists have names longer than 10 characters?
Example Query Generated:
SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10
The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.
How to create Phoenix inferences and schemas for the corpus data
Below is an example dataframe containing Wikipedia articles along with its embedding vector.
Below is an appropriate schema for the dataframe above. It specifies the id
column and that embedding
belongs to text
. Other columns, if exist, will be detected automatically, and need not be specified by the schema.
Define the inferences by pairing the dataframe with the schema.
Prompt Playground
Prompt Playground
Setup Tracing in or
Add Integrations via
your application
Phoenix natively works with a variety of frameworks and SDKs across and via OpenTelemetry auto-instrumentation. Phoenix can also be natively integrated with AI platforms such as and .
In the notebook, you can set the PHOENIX_PROJECT_NAME
environment variable before adding instrumentation or running any of your code.
In python this would look like:
Note that setting a project via an environment variable only works in a notebook and must be done BEFORE instrumentation is initialized. If you are using OpenInference Instrumentation, see the Server tab for how to set the project name in the Resource attributes.
Alternatively, you can set the project name in your register
function call:
If you are using Phoenix as a collector and running your application separately, you can set the project name in the Resource
attributes for the trace provider.
- Create, store, modify, and deploy prompts for interacting with LLMs
- Play with prompts, models, invocation parameters and track your progress via tracing and experiments
- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.
The prompt editor (typically on the left side of the screen) is where you define the . You select the template language (mustache or f-string) on the toolbar. Whenever you type a variable placeholder in the prompt (say {{question}} for mustache), the variable to fill will show up in the inputs section. Input variables must either be filled in by hand or can be filled in via a dataset (where each row has key / value pairs for the input).
Phoenix lets you run a prompt (or multiple prompts) on a dataset. Simply containing the input variables you want to use in your prompt template. When you click Run, Phoenix will apply each configured prompt to every example in the dataset, invoking the LLM for all possible prompt-example combinations. The result of your playground runs will be tracked as an experiment under the loaded dataset (see )
Check out our to get started. Look at our to better understand how to troubleshoot and evaluate different kinds of retrieval systems. For a high level overview on evaluation, check out our .
Instrumenting prompt templates and variables allows you to track and visualize prompt changes. These can also be combined with to measure the performance changes driven by each of your prompts.
We provide a using_prompt_template
context manager to add a prompt template (including its version and variables) to the current OpenTelemetry Context. OpenInference will read this Context and pass the prompt template fields as span attributes, following the OpenInference . Its inputs must be of the following type:
Template: non-empty string.
Version: non-empty string.
Variables: a dictionary with string keys. This dictionary will be serialized to JSON when saved to the OTEL Context and remain a JSON string when sent as a span attribute.
It can also be used as a decorator:
template - a string with templated variables ex. "hello {{name}}"
variables - an object with variable names and their values ex. {name: "world"}
version - a string version of the template ex. v1.0
All of these are optional. Application of variables to a template will typically happen before the call to an llm and may not be picked up by auto instrumentation. So, this can be helpful to add to ensure you can see the templates and variables while troubleshooting.
We are continually iterating our templates, view the most up-to-date template .
For the Retrieval-Augmented Generation (RAG) use case, see the section.
See for the Retrieval-Augmented Generation (RAG) use case where relevant documents are retrieved for the question before constructing the context for the LLM.
In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. Corpus inferences can be imported into Phoenix as shown below.
The launcher accepts the corpus dataset through corpus=
parameter.
This prompt template is heavily inspired by the paper: .
We provide a setPromptTemplate
function which allows you to set a template, version, and variables on context. You can use this utility in conjunction with to set the active context. OpenInference will then pick up these attributes and add them to any spans created within the context.with
callback. The components of a prompt template are: