Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 109 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

English

Loading...

Loading...

Loading...

Loading...

Tracing

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Prompt Engineering

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Datasets & Experiments

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Evaluation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Retrieval

Loading...

Loading...

inferences

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Quickstarts

Not sure where to start? Try a quickstart:

Phoenix runs seamlessly in notebooks, containers, Phoenix Cloud, or from the terminal.

Sessions

Track and analyze multi-turn conversations

Sessions enable tracking and organizing related traces across multi-turn conversations with your AI application. When building conversational AI, maintaining context between interactions is critical - Sessions make this possible from an observability perspective.

With Sessions in Phoenix, you can:

  • Track the entire history of a conversation in a single thread

  • View conversations in a chatbot-like UI showing inputs and outputs of each turn

  • Search through sessions to find specific interactions

  • Track token usage and latency per conversation

This feature is particularly valuable for applications where context builds over time, like chatbots, virtual assistants, or any other multi-turn interaction. By tagging spans with a consistent session ID, you create a connected view that reveals how your application performs across an entire user journey.

Next Steps

Setup Tracing

  • Learn how to use the phoenix.otel library

  • Learn how you can use basic OpenTelemetry to instrument your application.

  • Learn how to use Phoenix's decorators to easily instrument specific methods or code blocks in your application.

  • Setup tracing for your TypeScript application.

  • Learn about Projects in Phoenix, and how to use them.

  • Understand Sessions and how they can be used to group user conversations.

Features: Tracing

Tracing is a critical part of AI Observability and should be used both in production and development

Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.

This section contains details on Tracing features:

Advanced

  • Learn how to block PII from logging to Phoenix

  • Learn how to selectively block or turn off tracing

  • Learn how to send only certain spans to Phoenix

  • Learn how to trace images

Suppress Tracing

How to turn off tracing

Tracing can be paused temporarily or disabled permanently.

Pause tracing using context manager

If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing context manager as shown below.

Uninstrument the auto-instrumentors permanently

Calling .uninstrument() on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.

Learn more about options.

Check out how to

Environment
from phoenix.trace import suppress_tracing

with suppress_tracing():
    # Code running inside this block doesn't generate traces.
    # For example, running LLM evals here won't generate additional traces.
    ...
# Tracing will resume outside the block.
...
LangChainInstrumentor().uninstrument()
LlamaIndexInstrumentor().uninstrument()
OpenAIInstrumentor().uninstrument()
# etc.
Setup Sessions
Setup using Phoenix OTEL
Setup using base OTEL
Using Phoenix Decorators
Setup Tracing (TS)
Setup Projects
Setup Sessions
Projects
Annotations
Sessions
Mask Span Attributes
Suppress Tracing
Filter Spans to Export
Capture Multimodal Traces

Add Metadata

Tracing can be augmented and customized by adding Metadata. Metadata includes your own custom attributes, user ids, session ids, prompt templates, and more.

Add Attributes, Metadata, Users

  • Learn how to add custom metadata and attributes to your traces

Instrument Prompt Templates and Prompt Variables

  • Learn how to define custom prompt templates and variables in your tracing.

Quickstart: Tracing

Phoenix supports three main options to collect traces:

Quickstarts

Explore a Demo Trace

Projects

Use projects to organize your LLM traces

Projects provide organizational structure for your AI applications, allowing you to logically separate your observability data. This separation is essential for maintaining clarity and focus.

With Projects, you can:

  • Segregate traces by environment (development, staging, production)

  • Isolate different applications or use cases

  • Track separate experiments without cross-contamination

  • Maintain dedicated evaluation spaces for specific initiatives

  • Create team-specific workspaces for collaborative analysis

Projects act as containers that keep related traces and conversations together while preventing them from interfering with unrelated work. This organization becomes increasingly valuable as you scale - allowing you to easily switch between contexts without losing your place or mixing data.

The Project structure also enables comparative analysis across different implementations, models, or time periods. You can run parallel versions of your application in separate projects, then analyze the differences to identify improvements or regressions.

Explore a Demo Project

Annotations

In order to improve your LLM application iteratively, it's vital to collect feedback, annotate data during human review, as well as to establish an evaluation pipeline so that you can monitor your application. In Phoenix we capture this type of feedback in the form of annotations.

Phoenix gives you the ability to annotate traces with feedback from the UI, your application, or wherever you would like to perform evaluation. Phoenix's annotation model is simple yet powerful - given an entity such as a span that is collected, you can assign a label and/or a score to that entity.

Navigate to the Feedback tab in this demo trace to see how LLM-based evaluations appear in Phoenix:

Next Steps

  • Learn more about the concepts Concepts: Annotations

  • Configure Annotation Configs to guide human annotations.

  • How to run Running Evals on Traces

  • Learn how to log annotations via the client from your app or in a notebook

Annotate Traces

Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

  1. Track performance over time

  2. Identify areas for improvement

  3. Compare different model versions or prompts

  4. Gather data for fine-tuning or retraining

  5. Provide stakeholders with concrete metrics on system effectiveness

Phoenix allows you to annotate traces through the Client, the REST API, or the UI.

Guides

  • To learn how to configure annotations and to annotate through the UI, see Annotating in the UI

  • To learn how to add human labels to your traces, either manually or programmatically, see Annotating via the Client

  • To learn how to evaluate traces captured in Phoenix, see Running Evals on Traces

  • To learn how to upload your own evaluation labels into Phoenix, see Log Evaluation Results

For more background on the concept of annotations, see Annotations

Import Existing Traces

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

import os

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.

Importing Traces to an Existing Phoenix Instance

import phoenix as px

# Re-launch the app using trace data
px.launch_app(trace=px.TraceDataset(df))

# Load traces into an existing Phoenix instance
px.Client().log_traces(trace_dataset=px.TraceDataset(df))

# Load traces into an existing Phoenix instance from a local file
px.launch_app(trace=px.TraceDataset.load('f7733fda-6ad6-4427-a803-55ad2182b662', directory="/my_saved_traces/"))

Launching a new Phoenix Instance with Saved Traces

You can also launch a temporary version of Phoenix in your local notebook to quickly view the traces. But be warned, this Phoenix instance will only last as long as your notebook environment is runing

# Load traces from a dataframe
px.launch_app(trace=px.TraceDataset.load(my_traces))

# Load traces from a local file
px.launch_app(trace=px.TraceDataset.load('f7733fda-6ad6-4427-a803-55ad2182b662', directory="/my_saved_traces/"))

Prompt Management

Version and track changes made to prompt templates

Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.

Key benefits of prompt management include:

  • Reusability: Store and load prompts across different use cases.

  • Versioning: Track changes over time to ensure that the best performing version is deployed for use in your application.

  • Collaboration: Share prompts with others to maintain consistency and facilitate iteration.

To learn how to get started with prompt management, see Create a prompt

Overview: Prompts

Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.

Unlike traditional software, AI applications are non-deterministic and depend on natural language to provide context and guide model output. The pieces of natural language and associated model parameters embedded in your program are known as “prompts.”

Optimizing your prompts is typically the highest-leverage way to improve the behavior of your application, but “prompt engineering” comes with its own set of challenges. You want to be confident that changes to your prompts have the intended effect and don’t introduce regressions.

To get started, jump to Quickstart: Prompts.

Prompt Engineering Features

Phoenix offers a comprehensive suite of features to streamline your prompt engineering workflow.

Explore Demo Prompts

Prompt Playground

To learn more on how to use the playground, see Using the Playground

Prompts in Code

Pull and push prompt changes via Phoenix's Python and TypeScript Clients

Using Phoenix as a backend, Prompts can be managed and manipulated via code by using our Python or TypeScript SDKs.

With the Phoenix Client SDK you can:

To learn more about managing Prompts in code, see Using a prompt

Test a prompt

Testing your prompts before you ship them is vital to deploying reliable AI applications

Testing in the Playground

Testing a prompt in the playground

The Playground is a fast and efficient way to refine prompt variations. You can load previous prompts and validate their performance by applying different variables.

Each single-run test in the Playground is recorded as a span in the Playground project, allowing you to revisit and analyze LLM invocations later. These spans can be added to datasets or reloaded for further testing.

Testing a prompt over a dataset

The ideal way to test a prompt is to construct a golden dataset where the dataset examples contains the variables to be applied to the prompt in the inputs and the outputs contains the ideal answer you want from the LLM. This way you can run a given prompt over N number of examples all at once and compare the synthesized answers against the golden answers.

Testing prompt variations side-by-side

Prompt Playground supports side-by-side comparisons of multiple prompt variants. Click + Compare to add a new variant. Whether using Span Replay or testing prompts over a Dataset, the Playground processes inputs through each variant and displays the results for easy comparison.

Testing a prompt using code

Quickstart: Prompts

Quickstarts

Span Replay

Replay LLM spans traced in your application directly in the playground

Have you ever wanted to go back into a multi-step LLM chain and just replay one step to see if you could get a better outcome? Well you can with Phoenix's Span Replay. LLM spans that are stored within Phoenix can be loaded into the Prompt Playground and replayed. Replaying spans inside of Playground enables you to debug and improve the performance of your LLM systems by comparing LLM provider outputs, tweaking model parameters, changing prompt text, and more.

Chat completions generated inside of Playground are automatically instrumented, and the recorded spans are immediately available to be replayed inside of Playground.

Online Evals

This example:

  • Continuously queries a LangChain application to send new traces and spans to your Phoenix session

  • Queries new spans once per minute and runs evals, including:

    • Hallucination

    • Q&A Correctness

    • Relevance

  • Logs evaluations back to Phoenix so they appear in the UI

The evaluation script is run as a cron job, enabling you to adjust the frequency of the evaluation job:

* * * * * /path/to/python /path/to/run_evals.py

The above script can be run periodically to augment Evals in Phoenix.

Environments

The Phoenix app can be run in various environments such as Colab and SageMaker notebooks, as well as be served via the terminal or a docker container.

Phoenix Cloud

If you're using Phoenix Cloud, be sure to set the proper environment variables to connect to your instance:

Container

Notebooks

To start phoenix in a notebook environment, run:

This will start a local Phoenix server. You can initialize the phoenix server with various kinds of data (traces, inferences).

By default, Phoenix does not persist your data when run in a notebook.

Terminal

If you want to start a phoenix server to collect traces, you can also run phoenix directly from the command line:

This will start the phoenix server on port 6006. If you are running your instrumented notebook or application on the same machine, traces should automatically be exported to http://127.0.0.1:6006 so no additional configuration is needed. However if the server is running remotely, you will have to modify the environment variable PHOENIX_COLLECTOR_ENDPOINT to point to that machine (e.g. http://<my-remote-machine>:<port>)

Overview: Tracing

Tracing the execution of LLM applications using Telemetry

Phoenix traces AI applications, via OpenTelemetry and has first-class integrations with LlamaIndex, Langchain, OpenAI, and others.

LLM tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation to provide a detailed timeline of the request's execution.

Using Phoenix's tracing capabilities can provide important insights into the inner workings of your LLM application. By analyzing the collected trace data, you can identify and address various performance and operational issues and improve the overall reliability and efficiency of your system.

  • Application Latency: Identify and address slow invocations of LLMs, Retrievers, and other components within your application, enabling you to optimize performance and responsiveness.

  • Token Usage: Gain a detailed breakdown of token usage for your LLM calls, allowing you to identify and optimize the most expensive LLM invocations.

  • Runtime Exceptions: Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.

  • Retrieved Documents: Inspect the documents retrieved during a Retriever call, including the score and order in which they were returned to provide insight into the retrieval process.

  • Embeddings: Examine the embedding text used for retrieval and the underlying embedding model to allow you to validate and refine your embedding strategies.

  • LLM Parameters: Inspect the parameters used when calling an LLM, such as temperature and system prompts, to ensure optimal configuration and debugging.

  • Prompt Templates: Understand the prompt templates used during the prompting step and the variables that were applied, allowing you to fine-tune and improve your prompting strategies.

  • Tool Descriptions: View the descriptions and function signatures of the tools your LLM has been given access to in order to better understand and control your LLM’s capabilities.

  • LLM Function Calls: For LLMs with function call capabilities (e.g., OpenAI), you can inspect the function selection and function messages in the input to the LLM, further improving your ability to debug and optimize your application.

By using tracing in Phoenix, you can gain increased visibility into your LLM application, empowering you to identify and address performance bottlenecks, optimize resource utilization, and ensure the overall reliability and effectiveness of your system.

Next steps

Setup Projects

Log to a specific project

Phoenix uses projects to group traces. If left unspecified, all traces are sent to a default project.

Projects work by setting something called the Resource attributes (as seen in the OTEL example above). The phoenix server uses the project name attribute to group traces into the appropriate project.

Switching projects in a notebook

Typically you want traces for an LLM app to all be grouped in one project. However, while working with Phoenix inside a notebook, we provide a utility to temporarily associate spans with different projects. You can use this to trace things like evaluations.

Annotating in the UI

How to annotate traces in the UI for analysis and dataset curation

Configuring Annotations

To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.

Adding Annotations

Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate button and fill out the form to rate different steps in your AI application. You can also take notes as you go by either clicking on the explain link or by adding your notes to the bottom messages UI. You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations tab.

Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.

Viewing Annotations

As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.

Exporting Traces with specific Annotation Values

Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.

Exporting Annotated Spans

Span annotations can be an extremely valuable basis for improving your application. The Phoenix client provides useful ways to pull down spans and their associated annotations. This information can be used to:

  • build new LLM judges

  • form the basis for new datasets

  • help identify ideas for improving your application

Pulling Spans

If you only want the spans that contain a specific annotation, you can pass in a query that filters on annotation names, scores, or labels.

The queries can also filter by annotation scores and labels.

This spans dataframe can be used to pull associated annotations.

Instead of an input dataframe, you can also pass in a list of ids:

The annotations and spans dataframes can be easily joined to produce a one-row-per-annotation dataframe that can be used to analyze the annotations!

Capture Multimodal Traces

Phoenix supports displaying images that are included in LLM traces.

To view images in Phoenix

  1. Include either a base64 UTF-8 encoded image or an image url in the call made to your LLM

Example

You should see your image appear in Phoenix:

Overview: Datasets & Experiments

The velocity of AI application development is bottlenecked by quality evaluations because AI engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. High quality evaluations are critical as they can help developers answer these types of questions with greater confidence.

Datasets

Datasets are integral to evaluation. They are collections of examples that provide the inputs and, optionally, expected reference outputs for assessing your application. Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are used to run experiments and evaluations to track improvements to your prompt, LLM, or other parts of your LLM application.

Experiments

In AI development, it's hard to understand how a change will affect performance. This breaks the dev flow, making iteration more guesswork than engineering.

Experiments and evaluations solve this, helping distill the indeterminism of LLMs into tangible feedback that helps you ship more reliable product.

Specifically, good evals help you:

  • Understand whether an update is an improvement or a regression

  • Drill down into good / bad examples

  • Compare specific examples vs. prior runs

  • Avoid guesswork

Exporting Datasets

Exporting to CSV

Want to just use the contents of your dataset in another context? Simply click on the export to CSV button on the dataset page and you are good to go!

Exporting for Fine-Tuning

Fine-tuning lets you get more out of the models available by providing:

  • Higher quality results than prompting

  • Ability to train on more examples than can fit in a prompt

  • Token savings due to shorter prompts

  • Lower latency requests

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests. Phoenix natively exports OpenAI Fine-Tuning JSONL as long as the dataset contains compatible inputs and outputs.

Exporting OpenAI Evals

How to: Prompts

Guides on how to do prompt engineering with Phoenix

Getting Started

Prompt Management

Organize and manage prompts with Phoenix to streamline your development workflow

Prompt management is currently available on a feature branch only and will be released in the next major version.

Playground

Iterate on prompts and models in the prompt playground

Importing & Exporting Traces

  • Learn how to load a file of traces into Phoenix

  • Learn how to export trace data from Phoenix

How-to: Experiments

How to run experiments

How to use evaluators

Agent Path Convergence

When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.

One way of doing this is to calculate convergence:

  1. Run your agent on a set of similar queries

  2. Record the number of steps taken for each

  3. Calculate the convergence score: avg(minimum steps taken / steps taken for this run)

This will give a convergence score of 0-1, with 1 being a perfect score.

Filter Spans to Export

Sometimes while instrumenting your application, you may want to filter out or modify certain spans from being sent to Phoenix. For example, you may want to filter out spans that are that contain sensitive information or contain redundant information.

To do this, you can use a custom SpanProcessor and attach it to the OpenTelemetry TracerProvider.

How-to: Datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning,

How to create datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning, and much more. Phoenix allows you to build datasets manually, programmatically, or from files.

Exporting datasets

Export datasets for offline analysis, evals, and fine-tuning.

Agent Planning

This template evaluates a plan generated by an agent. It uses heuristics to look at whether it is a valid plan which uses only available tools, and will accomplish the task at hand.

Prompt Template

Overview: Evals

Phoenix Evals come with:

  • Speed - Phoenix evals are designed for maximum speed and throughput. Evals run in batches and typically run 10x faster than calling the APIs directly.

Audio Emotion Detection

The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.

Template Details

The following is the structure of the EMOTION_PROMPT_TEMPLATE:

Template Module

The prompt and evaluation logic are part of the phoenix.evals.default_audio_templates module and are defined as:

  • EMOTION_AUDIO_RAILS: Output options for the evaluation template.

  • EMOTION_PROMPT_TEMPLATE: Prompt used for evaluating audio emotions.

Agent Reflection

Use this prompt template to evaluate an agent's final response. This is an optional step, which you can use as a gate to retry a set of actions if the response or state of the world is insufficient for the given task.

Read more:

Prompt Template

View the inner workings for your LLM Application

Use to mark functions and code blocks.

Use to capture all calls made to supported frameworks.

Use instrumentation. Supported in and

Applying the scientific method to building AI products - By Eugene Yan
Adding manual annotations to traces

Phoenix supports loading data that contains . This allows you to load an existing dataframe of traces into your Phoenix instance.

Usually these will be traces you've previously saved using .

Iterate on prompts, ship prompts when they are tested
Use the playground to engineer the optimal prompt for your task

- Create, store, modify, and deploy prompts for interacting with LLMs

- Play with prompts, models, invocation parameters and track your progress via tracing and experiments

- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.

- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.

Phoenix's Prompt Playground makes the process of iterating and testing prompts quick and easy. Phoenix's playground supports (OpenAI, Anthropic, Gemini, Azure) as well as custom model endpoints, making it the ideal prompt IDE for you to build experiment and evaluate prompts and models for your task.

Speed: Rapidly test variations in the , model, invocation parameters, , and output format.

Reproducibility: All runs of the playground are , unlocking annotations and evaluation.

Datasets: Use as a fixture to run a prompt variant through its paces and to evaluate it systematically.

Prompt Management: directly within the playground.

Pull and push prompt changes via Phoenix's Python and TypeScript Clients

prompts dynamically

templates by name, version, or tag

templates with runtime variables and use them in your code. Native support for OpenAI, Anthropic, Gemini, Vercel AI SDK, and more. No propriatry client necessary.

Support for and Execute tools defined within the prompt. Phoenix prompts encompasses more than just the text and messages.

Playground integrates with to help you iterate and incrementally improve your prompts. Experiment runs are automatically recorded and available for subsequent evaluation to help you understand how changes to your prompts, LLM model, or invocation parameters affect performance.

Sometimes you may want to test a prompt and run evaluations on a given prompt. This can be particularly useful when custom manipulation is needed (e.x. you are trying to iterate on a system prompt on a variety of different chat messages). This tutorial is coming soon

Prompts in Phoenix can be created, iterated on, versioned, tagged, and used either via the UI or our Python/TS SDKs. The UI option also includes our , which allows you to compare prompt variations side-by-side in the Phoenix UI.

Replay LLM spans traced in your application directly in the playground

You can use cron to run evals client-side as your traces and spans are generated, augmenting your dataset with evaluations in an online manner. View the .

If you are set up, see to start using Phoenix in your preferred environment.

provides free-to-use Phoenix instances that are preconfigured for you with 10GBs of storage space. Phoenix Cloud instances are a great starting point, however if you need more storage or more control over your instance, self-hosting options could be a better fit.

See .

Tracing is a helpful tool for understanding how your LLM application works. Phoenix offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework. Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks ( , ,), SDKs (, , , ), and Languages. (Python, Javascript, etc.)

To get started, check out the .

Read more about and

Check out the for specific tutorials.

In the notebook, you can set the PHOENIX_PROJECT_NAME environment variable before adding instrumentation or running any of your code.

In python this would look like:

Note that setting a project via an environment variable only works in a notebook and must be done BEFORE instrumentation is initialized. If you are using OpenInference Instrumentation, see the Server tab for how to set the project name in the Resource attributes.

Alternatively, you can set the project name in your register function call:

If you are using Phoenix as a collector and running your application separately, you can set the project name in the Resource attributes for the trace provider.

Annotation Types
  • annotation type: - Categorical: Predefined labels for selection. (e.x. 👍 or 👎) - Continuous: a score across a specified range. (e.g. confidence score 0-100) - Freeform: Open-ended text comments. (e.g. "correct")

  • Optimize the direction based on your goal: - Maximize: higher scores are better. (e.g. confidence) - Minimize: lower scores are better. (e.g. hallucinations) - None: direction optimization does not apply. (e.g. tone)

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. OpenAI Evals offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals. Phoenix can natively export the OpenAI Evals format as JSONL so you can use it with OpenAI Evals. See for details.

- how to configure API keys for OpenAI, Anthropic, Gemini, and more.

- how to create, update, and track prompt changes

- how to test changes to a prompt in the playground and in the notebook

- how to mark certain prompt versions as ready for

- how to integrate prompts into your code and experiments

- how to setup the playground and how to test prompt changes via datasets and experiments.

In this example, we're filtering out any spans that have the name "secret_span" by bypassing the on_start and on_end hooks of the inherited BatchSpanProcessor.

Notice that this logic can be extended to modify a span and redact sensitive information if preserving the span is preferred.

- how to quickly download a dataset to use elsewhere

- want to fine tune an LLM for better accuracy and cost? Export llm examples for fine-tuning.

- have some good examples to use for benchmarking of llms using OpenAI evals? export to OpenAI evals format.

The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix is designed for simple, fast, and accurate .

Pre-built evals - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates . Each eval is pre-tested on a variety of eval models. Find the most up-to-date template on .

Run evals on your own data - takes a dataframe as its primary input and output, making it easy to run evaluations on your own data - whether that's logs, traces, or datasets downloaded for benchmarking.

Built-in Explanations - All Phoenix evaluations include an that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.

- Phoenix let's you configure which foundation model you'd like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See

This prompt template is heavily inspired by the paper: .

🚧
Phoenix's decorators
automatic instrumentation
base OpenTelemetry
Python
TS / JS
OpenInference traces
Save All Traces
Prompt Management
Prompt Playground
Span Replay
Prompts in Code
various AI providers
prompt
tools
recorded as traces and experiments
dataset examples
Load, edit, and save prompts
Create / Update
Pull prompts
Format prompt
tool calling
response formats.
datasets and experiments
Prompt Playground
example in Github
export PHOENIX_CLIENT_HEADERS = "api_key=ENTER YOUR API KEY"
export PHOENIX_COLLECTOR_ENDPOINT = "https://app.phoenix.arize.com"
import phoenix as px

session = px.launch_app()
phoenix serve
from phoenix.trace import using_project

# Switch project to run evals
with using_project("my-eval-project"):
    # all spans created within this context will be associated with
    # the "my-eval-project" project.
    # Run evaluations here...
from phoenix.client import Client

client = Client()

spans = client.spans.get_spans_dataframe(
    project_identifier="default",  # you can also pass a project id
)
from phoenix.client import Client
from phoenix.client.types.span import SpanQuery

client = Client()
query = SpanQuery().where("annotations['correctness']")

spans = client.spans.get_spans_dataframe(
    query=query,
    project_identifier="default",  # you can also pass a project id
)
from phoenix.client import Client
from phoenix.client.types.span import SpanQuery

client = Client()
query = SpanQuery().where("annotations['correctness'].score == 1")
# query = SpanQuery().where("annotations['correctness'].label == 'correct'")

spans = client.spans.get_spans_dataframe(
    query=query,
    project_identifier="default",  # you can also pass a project id
)
annotations = client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans,
    project_identifier="default",
)
annotations = client.spans.get_span_annotations_dataframe(
    span_ids=list[spans.index],
    project_identifier="default",
)
annotations.join(spans, how="left")
pip install -q "arize-phoenix>=4.29.0" openinference-instrumentation-openai openai
# Check if PHOENIX_API_KEY is present in the environment variables.
# If it is, we'll use the cloud instance of Phoenix. If it's not, we'll start a local instance.
# A third option is to connect to a docker or locally hosted instance.
# See https://arize.com/docs/phoenix/setup/environments for more information.

# Launch Phoenix
import os
if "PHOENIX_API_KEY" in os.environ:
    os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
    os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

else:
    import phoenix as px

    px.launch_app().view()

# Connect to Phoenix
from phoenix.otel import register
tracer_provider = register()

# Instrument OpenAI calls in your application
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

# Make a call to OpenAI with an image provided
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
# Assume you have an output which has a list of messages, which is the path taken
all_outputs = [
]

optimal_path_length = min(all_outputs, key = lambda output: len(output))
ratios_sum = 0

for output in all_outputs:
    run_length = len(output)
    ratio = optimal_path_length / run_length
    ratios_sum += ratio

# Calculate the average ratio
if len(all_outputs) > 0:
    convergence = ratios_sum / len(all_outputs)
else:
    convergence = 0

print(f"The optimal path length is {optimal_path_length}")
print(f"The convergence is {convergence}")
You are an evaluation assistant. Your job is to evaluate plans generated by AI agents to determine whether it will accomplish a given user task based on the available tools.

Here is the data:
    [BEGIN DATA]
    ************
    [User task]: {task}
    ************
    [Tools]: {tool_definitions}
    ************
    [Plan]: {plan}
    [END DATA]

Here is the criteria for evaluation
1. Does the plan include only valid and applicable tools for the task?  
2. Are the tools used in the plan sufficient to accomplish the task?  
3. Will the plan, as outlined, successfully achieve the desired outcome?  
4. Is this the shortest and most efficient plan to accomplish the task?

Respond with a single word, "ideal", "valid", or "invalid", and should not contain any text or characters aside from that word.

"ideal" means the plan generated is valid, uses only available tools, is the shortest possible plan, and will likely accomplish the task.

"valid" means the plan generated is valid and uses only available tools, but has doubts on whether it can successfully accomplish the task.

"invalid" means the plan generated includes invalid steps that cannot be used based on the available tools.
You are an AI system designed to classify emotions in audio files.

### TASK:
Analyze the provided audio file and classify the primary emotion based on these characteristics:
- Tone: General tone of the speaker (e.g., cheerful, tense, calm).
- Pitch: Level and variability of the pitch (e.g., high, low, monotone).
- Pace: Speed of speech (e.g., fast, slow, steady).
- Volume: Loudness of the speech (e.g., loud, soft, moderate).
- Intensity: Emotional strength or expression (e.g., subdued, sharp, exaggerated).

The classified emotion must be one of the following:
['anger', 'happiness', 'excitement', 'sadness', 'neutral', 'frustration', 'fear', 'surprise', 'disgust', 'other']

IMPORTANT: Choose the most dominant emotion expressed in the audio. Neutral should only be used when no other emotion is clearly present; do your best to avoid this label.

************

Here is the audio to classify:

{audio}

RESPONSE FORMAT:

Provide a single word from the list above representing the detected emotion.

************

EXAMPLE RESPONSE: excitement

************

Analyze the audio and respond in this format.
You are an expert in {topic}. I will give you a user query. Your task is to reflect on your provided solution and whether it has solved the problem.
First, explain whether you believe the solution is correct or incorrect.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, create a list of detailed instructions to help you correctly solve this problem in the future if it is incorrect.

Be concise in your response; however, capture all of the essential information.

Here is the data:
    [BEGIN DATA]
    ************
    [User Query]: {user_query}
    ************
    [Tools]: {tool_definitions}
    ************
    [State]: {current_state}
    ************
    [Provided Solution]: {solution}
    [END DATA]
Configure AI Providers
Create a prompt
Test a prompt
Tag a prompt
Using a prompt
Using the Playground
Import Existing Traces
Export Data & Query Spans
Quickstarts
Phoenix Cloud
Self-Hosting
LlamaIndex
LangChain
DSPy
OpenAI
Bedrock
Mistral
Vertex
Quickstart guide
what traces are
how traces work
.
How-To Guides
Connect to a Phoenix instance
Instrument your application
https://github.com/openai/evals
How to upload a Dataset
How to run a custom task
How to configure evaluators
How to run the experiment
LLM Evaluators
Code Evaluators
Custom Evaluators
Create datasets from CSV
Create datasets from Pandas
Create datasets from spans
Create datasets using synthetic data
Exporting to OpenAI Ft
Exporting to OpenAI Evals
LLM Evals library
LLM-based evaluations
here
GitHub
Phoenix Evals
explanation flag
Eval Models
Eval Models
How to evaluate the evaluators and build self-improving evals
Prompt optimization
"Self Reflection in LLM Agents"

SQL Generation Eval

SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.

Example of a Question: How many artists have names longer than 10 characters?

Example Query Generated:

SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10

The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.

SQL Eval Template

SQL Evaluation Prompt:
-----------------------
You are tasked with determining if the SQL generated appropiately answers a given 
instruction taking into account its generated query and response.

Data:
-----
- [Instruction]: {question}
  This section contains the specific task or problem that the sql query is intended 
  to solve.

- [Reference Query]: {query_gen}
  This is the sql query submitted for evaluation. Analyze it in the context of the 
  provided instruction.

- [Provided Response]: {response}
  This is the response and/or conclusions made after running the sql query through 
  the database

Evaluation:
-----------
Your response should be a single word: either "correct" or "incorrect".
You must assume that the db exists and that columns are appropiately named.
You must take into account the response as additional information to determine the 
correctness.

Running an SQL Generation Eval

rails = list(SQL_GEN_EVAL_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)
relevance_classifications = llm_classify(
    dataframe=df,
    template=SQL_GEN_EVAL_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

How to: Evals

Run evaluations via a job to visualize in the UI as traces stream in.

Evaluate traces captured in Phoenix and export results to the Phoenix UI.

Evaluate tasks with multiple inputs/outputs (ex: text, audio, image) using versatile evaluation tasks.

User Frustration

Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.

User Frustration Eval Template

  You are given a conversation where between a user and an assistant.
  Here is the conversation:
  [BEGIN DATA]
  *****************
  Conversation:
  {conversation}
  *****************
  [END DATA]

  Examine the conversation and determine whether or not the user got frustrated from the experience.
  Frustration can range from midly frustrated to extremely frustrated. If the user seemed frustrated
  at the beginning of the conversation but seemed satisfied at the end, they should not be deemed
  as frustrated. Focus on how the user left the conversation.

  Your response must be a single word, either "frustrated" or "ok", and should not
  contain any text or characters aside from that word. "frustrated" means the user was left
  frustrated as a result of the conversation. "ok" means that the user did not get frustrated
  from the conversation.

The following is an example of code snippet showing how to use the eval above template:

from phoenix.evals import (
    USER_FRUSTRATION_PROMPT_RAILS_MAP,
    USER_FRUSTRATION_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(USER_FRUSTRATION_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=USER_FRUSTRATION_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

Prompt and Response (LLM)

How to import prompt and response from Large Large Model (LLM)

Dataframe

Below shows a relevant subsection of the dataframe. The embedding of the prompt is also shown.

prompt
embedding
response

who was the first person that walked on the moon

[-0.0126, 0.0039, 0.0217, ...

Neil Alden Armstrong

who was the 15th prime minister of australia

[0.0351, 0.0632, -0.0609, ...

Francis Michael Forde

Schema

primary_schema = Schema(
    prediction_id_column_name="id",
    prompt_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="prompt",
    )
    response_column_names="response",
)

Inferences

Define the inferences by pairing the dataframe with the schema.

primary_inferences = px.Inferences(primary_dataframe, primary_schema)

Application

session = px.launch_app(primary_inferences)

Corpus Data

How to create Phoenix inferences and schemas for the corpus data

Inferences

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

id
text
embedding

1

Voyager 2 is a spacecraft used by NASA to expl...

[-0.02785328, -0.04709944, 0.042922903, 0.0559...

2

The Staturn Nebula is a planetary nebula in th...

[0.03544901, 0.039175965, 0.014074919, -0.0307...

3

Eris is a dwarf planet and a trans-Neptunian o...

[0.05506449, 0.0031612846, -0.020452883, -0.02...

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

corpus_schema = px.Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
)

Inferences

Define the inferences by pairing the dataframe with the schema.

corpus_inferences = px.Inferences(corpus_dataframe, corpus_schema)

Application

session = px.launch_app(production_dataset, corpus=corpus_inferences)

How-to: Inferences

How to export your data for labeling, evaluation, or fine-tuning

Overview: Retrieval

Many LLM applications use a technique called Retrieval Augmented Generation. These applications retrieve data from their knowledge base to help the LLM accomplish tasks with the appropriate context.

However, these retrieval systems can still hallucinate or provide answers that are not relevant to the user's input query. We can evaluate retrieval systems by checking for:

  1. Are there certain types of questions the chatbot gets wrong more often?

  2. Are the documents that the system retrieves irrelevant? Do we have the right documents to answer the question?

  3. Does the response match the provided documents?

Phoenix supports retrievals troubleshooting and evaluation on both traces and inferences, but inferences are currently required to visualize your retrievals using a UMAP. See below on the differences.

Feature
Traces & Spans
Inferences

Troubleshooting for LLM applications

✅

✅

Follow the entirety of an LLM workflow

✅

🚫 support for spans only

Embeddings Visualizer

🚧 on the roadmap

✅

Retrieval (RAG)

How to import data for the Retrieval-Augmented Generation (RAG) use case

Dataframe

query
embedding
retrieved_document_ids
relevance_scores

who was the first person that walked on the moon

[-0.0126, 0.0039, 0.0217, ...

[7395, 567965, 323794, ...

[11.30, 7.67, 5.85, ...

who was the 15th prime minister of australia

[0.0351, 0.0632, -0.0609, ...

[38906, 38909, 38912, ...

[11.28, 9.10, 8.39, ...

why is amino group in aniline an ortho para di...

[-0.0431, -0.0407, -0.0597, ...

[779579, 563725, 309367, ...

[-10.89, -10.90, -10.94, ...

Schema

Both the retrievals and scores are grouped under prompt_column_names along with the embedding of the query.

primary_schema = Schema(
    prediction_id_column_name="id",
    prompt_column_names=RetrievalEmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="query",
        context_retrieval_ids_column_name="retrieved_document_ids",
        context_retrieval_scores_column_name="relevance_scores",
    )
)

Inferences

Define the inferences by pairing the dataframe with the schema.

primary_inferences = px.Inferences(primary_dataframe, primary_schema)

Application

session = px.launch_app(primary_inferences)

Arize Phoenix

AI Observability and Evaluation

Features

Quickstarts

Running Phoenix for the first time? Select a quickstart below.

Next Steps

Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.

Add instrumentation for popular packages and libraries such as OpenAI, LangGraph, Vercel AI SDK and more.

Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.

Quickstart: Tracing (Python)

Overview

Phoenix supports three main options to collect traces:

This example uses options 1 and 2.

Launch Phoenix

Connect to Phoenix

To collect traces from your application, you must configure an OpenTelemetry TracerProvider to send traces to Phoenix.

Trace your own functions

Functions can be traced using decorators:

Input and output attributes are set automatically based on my_func's parameters and return.

Trace all calls made to a library

OpenInference libraries must be installed before calling the register function

View your Traces in Phoenix

You should now see traces in Phoenix!

Next Steps

Setup using Phoenix OTEL

phoenix.otel is a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults.

These defaults are aware of environment variables you may have set to configure Phoenix:

  • PHOENIX_COLLECTOR_ENDPOINT

  • PHOENIX_PROJECT_NAME

  • PHOENIX_CLIENT_HEADERS

  • PHOENIX_API_KEY

  • PHOENIX_GRPC_PORT

Quickstart: phoenix.otel.register

The phoenix.otel module provides a high-level register function to configure OpenTelemetry tracing by setting a global TracerProvider. The register function can also configure headers and whether or not to process spans one by one or by batch.

Phoenix Authentication

If the PHOENIX_API_KEY environment variable is set, register will automatically add an authorization header to each span payload.

Configuring the collector endpoint

There are two ways to configure the collector endpoint:

  • Using environment variables

  • Using the endpoint keyword argument

Using environment variables

If you're setting the PHOENIX_COLLECTOR_ENDPOINT environment variable, register will automatically try to send spans to your Phoenix server using gRPC.

Specifying the endpoint directly

When passing in the endpoint argument, you must specify the fully qualified endpoint. If the PHOENIX_GRPC_PORT environment variable is set, it will override the default gRPC port.

The HTTP transport protocol is inferred from the endpoint

The GRPC transport protocol is inferred from the endpoint

Additionally, the protocol argument can be used to enforce the OTLP transport protocol regardless of the endpoint. This might be useful in cases such as when the GRPC endpoint is bound to a different port than the default (4317). The valid protocols are: "http/protobuf", and "grpc".

Additional configuration

register can be configured with different keyword arguments:

  • project_name: The Phoenix project name

    • or use PHOENIX_PROJECT_NAME env. var

  • headers: Headers to send along with each span payload

    • or use PHOENIX_CLIENT_HEADERS env. var

  • batch: Whether or not to process spans in batch

Instrumentation

Once you've connected your application to your Phoenix instance using phoenix.otel.register, you need to instrument your application. You have a few options to do this:

Instrument Prompt Templates and Prompt Variables

Mask Span Attributes

In some situations, you may need to modify the observability level of your tracing. For instance, you may want to keep sensitive information from being logged for security reasons, or you may want to limit the size of the base64 encoded images logged to reduced payload size.

The OpenInference Specification defines a set of environment variables you can configure to suit your observability needs. In addition, the OpenInference auto-instrumentors accept a trace config which allows you to set these value in code without having to set environment variables, if that's what you prefer

The possible settings are:

To set up this configuration you can either:

  • Set environment variables as specified above

  • Define the configuration in code as shown below

  • Do nothing and fall back to the default values

  • Use a combination of the three, the order of precedence is:

    • Values set in the TraceConfig in code

    • Environment variables

    • default values

Below is an example of how to set these values in code using our OpenAI Python and JavaScript instrumentors, however, the config is respected by all of our auto-instrumentors.

Quickstart: Prompts (UI)

Getting Started

Prompt playground can be accessed from the left navbar of Phoenix.

From here, you can directly prompt your model by modifying either the system or user prompt, and pressing the Run button on the top right.

Basic Example Use Case

Let's start by comparing a few different prompt variations. Add two additional prompts using the +Prompt button, and update the system and user prompts like so:

System prompt #1:

System prompt #2:

System prompt #3:

User prompt (use this for all three):

Your playground should look something like this:

Let's run it and compare results:

Creating a Prompt

Your prompt will be saved in the Prompts tab:

Now you're ready to see how that prompt performs over a larger dataset of examples.

Running over a dataset

Next, create a new dataset from the Datasets tab in Phoenix, and specify the input and output columns like so:

Now we can return to Prompt Playground, and this time choose our new dataset from the "Test over dataset" dropdown.

You can also load in your saved Prompt:

We'll also need to update our prompt to look for the {{input_article}} column in our dataset. After adding this in, be sure to save your prompt once more!

Now if we run our prompt(s), each row of the dataset will be run through each variation of our prompt.

And if you return to view your dataset, you'll see the details of that run saved as an experiment.

Updating a Prompt

You can now easily modify you prompt or compare different versions side-by-side. Let's say you've found a stronger version of the prompt. Save your updated prompt once again, and you'll see it added as a new version under your existing prompt:

You can also tag which version you've deemed ready for production, and view code to access your prompt in code further down the page.

Next Steps

Now you're ready to create, test, save, and iterate on your Prompts in Phoenix! Check out our other quickstarts to see how to use Prompts in code.

Using the Playground

General guidelines on how to use Phoenix's prompt playground

Setup

If successful you should see the LLM output stream out in the Output section of the UI.

Prompt Editor

Model Configuration

Every prompt instance can be configured to use a specific LLM and set of invocation parameters. Click on the model configuration button at the top of the prompt editor and configure your LLM of choice. Click on the "save as default" option to make your configuration sticky across playground sessions.

Comparing Prompts

The Prompt Playground offers the capability to compare multiple prompt variants directly within the playground. Simply click the + Compare button at the top of the first prompt to create duplicate instances. Each prompt variant manages its own independent template, model, and parameters. This allows you to quickly compare prompts (labeled A, B, C, and D in the UI) and run experiments to determine which prompt and model configuration is optimal for the given task.

Using Datasets with Prompts

Playground Traces

All invocations of an LLM via the playground is recorded for analysis, annotations, evaluations, and dataset curation.

If you simply run an LLM in the playground using the free form inputs (e.g. not using a dataset), Your spans will be recorded in a project aptly titled "playground".

If however you run a prompt over dataset examples, the outputs and spans from your playground runs will be captured as an experiment. Each experiment will be named according to the prompt you ran the experiment over.

View the inner workings for your LLM Application
Different types of annotations change the way human annotators provide feedback
Configure an annotation to guide how a user should input an annotation
Once annotations are configured, you can add them to your project to build out a custom annotation form
You can view the annotations by different users, llms, and annotators
Narrow down your data to areas that need more attention or refinement
How Datasets are used to test changes to your AI application

We are continually iterating our templates, view the most up-to-date template .

(llm_classify)

(llm_generate)

We are continually iterating our templates, view the most up-to-date template .

For the Retrieval-Augmented Generation (RAG) use case, see the section.

See for the Retrieval-Augmented Generation (RAG) use case where relevant documents are retrieved for the question before constructing the context for the LLM.

In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. Corpus inferences can be imported into Phoenix as shown below.

The launcher accepts the corpus dataset through corpus= parameter.

Check out our to get started. Look at our to better understand how to troubleshoot and evaluate different kinds of retrieval systems. For a high level overview on evaluation, check out our .

In Retrieval-Augmented Generation (RAG), the retrieval step returns from a (proprietary) knowledge base (a.k.a. ) a list of documents relevant to the user query, then the generation step adds the retrieved documents to the prompt context to improve response accuracy of the Large Language Model (LLM). The IDs of the retrieval documents along with the relevance scores, if present, can be imported into Phoenix as follows.

Below shows only the relevant subsection of the dataframe. The retrieved_document_ids should matched the ids in the data. Note that for each row, the list under the relevance_scores column have a matching length as the one under the retrievals column. But it's not necessary for all retrieval lists to have the same length.

Phoenix is an open-source observability tool designed for experimentation, evaluation, and troubleshooting of AI and LLM applications. It allows AI engineers and data scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by , the company behind the industry-leading AI observability platform, and a set of core contributors.

Phoenix works with OpenTelemetry and instrumentation. See for details.

Phoenix offers tools to workflow.

  • - Create, store, modify, and deploy prompts for interacting with LLMs

  • - Play with prompts, models, invocation parameters and track your progress via tracing and experiments

  • - Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.

  • - Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.

Use to mark functions and code blocks.

Use to capture all calls made to supported frameworks.

Use instrumentation. Supported in and , among many other languages.

  1. Sign up for an Arize Phoenix account at

  2. Grab your API key from the Keys option on the left bar.

  3. In your code, set your endpoint and API key:

Having trouble finding your endpoint? Check out

  1. Run Phoenix using Docker, local terminal, Kubernetes etc. For more information, .

  2. In your code, set your endpoint:

Having trouble finding your endpoint? Check out

Phoenix can also capture all calls made to supported libraries automatically. Just install the :

Explore tracing

View use cases to see

Using OpenInference auto-instrumentors. If you've used the auto_instrument flag above, then any instrumentor packages in your environment will be called automatically. For a full list of OpenInference packages, see

Using .

Using .

Instrumenting prompt templates and variables allows you to track and visualize prompt changes. These can also be combined with to measure the performance changes driven by each of your prompts.

We provide a using_prompt_template context manager to add a prompt template (including its version and variables) to the current OpenTelemetry Context. OpenInference will read this Context and pass the prompt template fields as span attributes, following the OpenInference . Its inputs must be of the following type:

  • Template: non-empty string.

  • Version: non-empty string.

  • Variables: a dictionary with string keys. This dictionary will be serialized to JSON when saved to the OTEL Context and remain a JSON string when sent as a span attribute.

It can also be used as a decorator:

  • template - a string with templated variables ex. "hello {{name}}"

  • variables - an object with variable names and their values ex. {name: "world"}

  • version - a string version of the template ex. v1.0

All of these are optional. Application of variables to a template will typically happen before the call to an llm and may not be picked up by auto instrumentation. So, this can be helpful to add to ensure you can see the templates and variables while troubleshooting.

Environment Variable Name
Effect
Type
Default

It looks like the second option is doing the most concise summary. Go ahead and .

Prompt playground can be used to run a series of dataset rows through your prompts. To start off, we'll need a dataset. Phoenix has many options to , to keep things simple here, we'll directly upload a CSV. Download the articles summaries file linked below:

From here, you could to test its performance, or add complexity to your prompts by including different tools, output schemas, and models to test against.

To first get started, you will first . In the playground view, create a valid prompt for the LLM and click Run on the top right (or the mod + enter)

The prompt editor (typically on the left side of the screen) is where you define the . You select the template language (mustache or f-string) on the toolbar. Whenever you type a variable placeholder in the prompt (say {{question}} for mustache), the variable to fill will show up in the inputs section. Input variables must either be filled in by hand or can be filled in via a dataset (where each row has key / value pairs for the input).

Phoenix lets you run a prompt (or multiple prompts) on a dataset. Simply containing the input variables you want to use in your prompt template. When you click Run, Phoenix will apply each configured prompt to every example in the dataset, invoking the LLM for all possible prompt-example combinations. The result of your playground runs will be tracked as an experiment under the loaded dataset (see )

is a helpful tool for understanding how your LLM application works. Phoenix's open-source library offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework.

Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks (, ,), SDKs (, , , ), and Languages. (, , etc.)

Phoenix is built to help you and understand their true performance. To accomplish this, Phoenix includes:

A standalone library to on your own datasets. This can be used either with the Phoenix library, or independently over your own data.

into the Phoenix dashboard. Phoenix is built to be agnostic, and so these evals can be generated using Phoenix's library, or an external library like , , or .

to attach human ground truth labels to your data in Phoenix.

let you test different versions of your application, store relevant traces for evaluation and analysis, and build robust evaluations into your development process.

to test and compare different iterations of your application

, or directly upload Datasets from code / CSV

, export them in fine-tuning format, or attach them to an Experiment.

We provide a setPromptTemplate function which allows you to set a template, version, and variables on context. You can use this utility in conjunction with to set the active context. OpenInference will then pick up these attributes and add them to any spans created within the context.with callback. The components of a prompt template are: