Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.
Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.
Running Phoenix for the first time? Select a quickstart below.
Don't know which one to choose? Phoenix has two main data ingestion methods:
Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.
Learn about best practices, and how to get started with use case examples such as Q&A with Retrieval, Summarization, and Chatbots.
Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.
Loading...
Loading...
AutoGen is a new agent framework from Microsoft that allows for complex Agent creation. It is unique in its ability to create multiple agents that work together.
The AutoGen Agent framework allows creation of multiple agents and connection of those agents to work together to accomplish tasks.
The Phoenix support is simple in its first incarnation but allows for capturing all of the prompt and responses that occur under the framework between each agent.
The individual prompt and responses are captured directly through OpenAI calls.
Loading...
Loading...
LLM observability is complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.
Evaluation is a measure of how well the response answers the prompt.
There are several ways to evaluate LLMs:
You can collect the feedback directly from your users. This is the simplest way but can often suffer from users not being willing to provide feedback or simply forgetting to do so. Other challenges arise from implementing this at scale.
The other approach is to use an LLM to evaluate the quality of the response for a particular prompt. This is more scalable and very useful but comes with typical LLM setbacks.
For more complex or agentic workflows, it may not be obvious which call in a span or which span in your trace (a run through your entire use case) is causing the problem. You may need to repeat the evaluation process on several spans before you narrow down the problem.
This pillar is largely about diving deep into the system to isolate the issue you are investigating.
Prompt engineering is the cheapest, fastest, and often the highest-leverage way to improve the performance of your application. Often, LLM performance can be improved simply by comparing different prompt templates, or iterating on the one you have. Prompt analysis is an important component in troubleshooting your LLM's performance.
A common way to improve performance is with more relevant information being fed in.
If you can retrieve more relevant information, your prompt improves automatically. Troubleshooting retrieval systems, however, is more complex. Are there queries that don’t have sufficient context? Should you add more context for these queries to get better answers? Or should you change your embeddings or chunking strategy?
Fine tuning essentially generates a new model that is more aligned with your exact usage conditions. Fine tuning is expensive, difficult, and may need to be done again as the underlying LLM or other conditions of your system change. This is a very powerful technique, requires much higher effort and complexity.
\
\
\
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The toolset is designed to ingest for , CV, NLP, and tabular datasets as well as . It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues & insights, and easily export to improve.
Phoenix is used on top of trace data generated by LlamaIndex and LangChain. The general use case is to troubleshoot LLM applications with agentic workflows.
: Phoenix is used to troubleshoot models whose datasets can be expressed as DataFrames in Python such as LLM applications built in Python workflows, CV, NLP, and tabular models.
Use the Phoenix Evals library to easily evaluate tasks such as hallucination, summarization, and retrieval relevance, or create your own custom template.
Get visibility into where your complex or agentic workflow broke, or find performance bottlenecks, across different span types with LLM Tracing.
Identify missing context in your knowledge base, and when irrelevant context is retrieved by visualizing query embeddings alongside knowledge base embeddings with RAG Analysis.
Compare and evaluate performance across model versions prior to deploying to production.
Connect teams and workflows, with continued analysis of production data from Arize in a notebook environment for fine tuning workflows.
Find clusters of problems using performance metrics or drift. Export clusters for retraining workflows.
Use the Embeddings Analyzer to surface data drift for computer vision, NLP, and tabular models.
- This helps you evaluate how well the response answers the prompt by using a separate evaluation LLM.
- This gives you visibility into where more complex or agentic workflows broke.
- Iterating on a prompt template can help improve LLM results.
- Improving the context that goes into the prompt can lead to better LLM responses.
- Fine-tuning generates a new model that is more aligned with your exact usage conditions for improved performance.
Learn more about library.
Learn more about support.
Learn about in Arize.
Learn more about with Phoenix.
Tracing the execution of LLM powered applications using OpenInference Traces
The rise of LangChain and LlamaIndex for LLM app development has enabled developers to move quickly in building applications powered by LLMs. The abstractions created by these frameworks can accelerate development, but also make it hard to debug the LLM app. Take the below example where a RAG application be written in a few lines of code but in reality has a very complex run tree.
LLM Traces and Observability lets us understand the system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?”
Phoenix's tracing module is the mechanism by which application code is instrumented, to help make a system observable.
Let's dive into the fundamental building block of traces: the span.
A span represents a unit of work or operation (think a span
of time). It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.
A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks. A span for an LLM execution in JSON format is displayed below
Spans can be nested, as is implied by the presence of a parent span ID: child spans represent sub-operations. This allows spans to more accurately capture the work done in an application.
A trace records the paths taken by requests (made by an application or end-user) as they propagate through multiple steps.
Without tracing, it is challenging to pinpoint the cause of performance problems in a system.
It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for LLM applications, which commonly have nondeterministic problems or are too complicated to reproduce locally.
Tracing makes debugging and understanding LLM applications less daunting by breaking down what happens within a request as it flows through a system.
A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).
When a span is created, it is created as one of the following: Chain, Retriever, Reranker, LLM, Embedding, Agent, or Tool.
CHAIN
A Chain is a starting point or a link between different LLM application steps. For example, a Chain span could be used to represent the beginning of a request to an LLM application or the glue code that passes context from a retriever to and LLM call.
RETRIEVER
A Retriever is a span that represents a data retrieval step. For example, a Retriever span could be used to represent a call to a vector store or a database.
RERANKER
A Reranker is a span that represents the reranking of a set of input documents. For example, a cross-encoder may be used to compute the input documents' relevance scores with respect to a user query, and the top K documents with the highest scores are then returned by the Reranker.
LLM
An LLM is a span that represents a call to an LLM. For example, an LLM span could be used to represent a call to OpenAI or Llama.
EMBEDDING
An Embedding is a span that represents a call to an LLM for an embedding. For example, an Embedding span could be used to represent a call OpenAI to get an ada-2 embedding for retrieval.
TOOL
A Tool is a span that represents a call to an external tool such as a calculator or a weather API.
AGENT
A span that encompasses calls to LLMs and Tools. An agent describes a reasoning block that acts on tools using the guidance of an LLM.
Attributes are key-value pairs that contain metadata that you can use to annotate a span to carry information about the operation it is tracking.
For example, if a span invokes an LLM, you can capture the model name, the invocation parameters, the token count, and so on.
Attributes have the following rules:
Keys must be non-null string values
The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your dataset is diverging from a given reference baseline.
Moments of high euclidean distance is an indication that the primary dataset is starting to drift from the reference dataset. As the primary dataset moves further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.
In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.
When two datasets are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary dataset than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.
How to fly with Phoenix
In your Jupyter or Colab environment, run the following command to install.
Once installed, import Phoenix in your notebook with
In order to be able to ask those questions of a system, the application must be properly instrumented. That is, the application code must emit signals such as and logs. An application is properly instrumented when developers don’t need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.
LLM Traces and the accompanying is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context such as retrieval from vector stores and the usage of external tools such as search engines or APIs. It lets you understand the inner workings of the individual steps your application takes wile also giving you visibility into how your system is running and performing as a whole.
Values must be a non-null string, boolean, floating point value, integer, or an array of these values Additionally, there are Semantic Attributes, which are known naming conventions for metadata that is typically present in common operations. It's helpful to use semantic attribute naming wherever possible so that common kinds of metadata are standardized across systems. See for more information.
Want to learn more about OpenInference Tracing? It is an open-source specification that is continuously is being evolved. Check out the details at
For each described in the dataset(s) , Phoenix serves a embeddings troubleshooting view to help you identify areas of drift and performance degradation. Let's start with embedding drift.
Note that when you are troubleshooting search and retrieval using a dataset, the euclidean distance of your queries to your knowledge base vectors is presented as query distance
For an in-depth guide of euclidean distance and embedding drift, check out
Phoenix automatically breaks up your embeddings into groups of inferences using a clustering algorithm called . This is particularly useful if you are trying to identify areas of your embeddings that are drifting or performing badly.
Phoenix projects the embeddings you provided into lower dimensional space (3 dimensions) using a dimension reduction algorithm called (stands for Uniform Manifold Approximation and Projection). This lets us understand how your in a visually understandable way. In addition to the point-cloud, another dimension we have at our disposal is color (and in some cases shape). Out of the box phoenix let's you assign colors to the UMAP point-cloud by dimension (features, tags, predictions, actuals), performance (correctness which distinguishes true positives and true negatives from the incorrect predictions), and dataset (to highlight areas of drift). This helps you explore your point-cloud from different perspectives depending on what you are looking for.
Note that the above only installs dependencies that are necessary to run the application. Phoenix also has an experimental sub-module where you can find .
For the Retrieval-Augmented Generation (RAG) use case, see the section.
See for the Retrieval-Augmented Generation (RAG) use case where relevant documents are retrieved for the question before constructing the context for the LLM.
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
Neil Alden Armstrong
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
Francis Michael Forde
Most evaluation libraries do NOT follow trustworthy benchmarking rigor necessary for production environments. Production LLM Evals need to benchmark both a model and "a prompt template". (i.e. the Open AI “model” Evals only focuses on evaluating the model, a different use case).
There is typically difficulty integrating benchmarking, development, production, or the LangChain/LlamaIndex callback system. Evals should process batches of data with optimal speed.
Obligation to use chain abstractions (i.e. LangChain shouldn't be a prerequisite for obtaining evaluations for pipelines that don't utilize it).
Phoenix evals are designed to run as fast as possible on batches of Eval data and maximize the throughput and usage of your API key. The current Phoenix library is 10x faster in throughput than current call-by-call-based approaches integrated into the LLM App Framework Evals.
Phoenix Evals are designed to run on dataframes, in Python pipelines, or in LangChain & LlamaIndex callbacks. Evals are also supported in Python pipelines for normal LLM deployments not using LlamaIndex or LangChain. There is also one-click support for Langchain and LlamaIndx support.
Evals are supported on a span level for LangChain and LlamaIndex.
How to import data for the Retrieval-Augmented Generation (RAG) use case
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
[7395, 567965, 323794, ...
[11.30, 7.67, 5.85, ...
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
[38906, 38909, 38912, ...
[11.28, 9.10, 8.39, ...
why is amino group in aniline an ortho para di...
[-0.0431, -0.0407, -0.0597, ...
[779579, 563725, 309367, ...
[-10.89, -10.90, -10.94, ...
Both the retrievals
and scores
are grouped under prompt_column_names
along with the embedding
of the query
.
Define the dataset by pairing the dataframe with the schema.
How to export your data for labeling, evaluation, or fine-tuning
Phoenix is designed to be a pre-production tool that can be used to find interesting or problematic data that can be used for various use-cases:
A subset of production data for re-labeling and training
A subset of data for fine-tuning an LLM
The easiest way to gather traces that have been collected by Phoenix is to directly pull a dataframe of the traces from your Phoenix session
object.
You can also directly get the spans from the tracer or callback:
Note that the above calls get_spans
on a LangChain tracer but the same exact method exists on the OpenInferenceCallback
for LlamaIndex as well.
Embeddings can be extremely useful for fine-tuning. There are two ways to export your embeddings from the Phoenix UI.
To export a cluster (either selected via the lasso tool or via a the cluster list on the right hand panel), click on the export button on the top left of the bottom slide-out.
This LLM Eval detects if the output of a model is a hallucination based on contextual data.
This Eval is designed specifically designed for hallucinations relative to private or retrieved data, is an answer to a question a hallucination based on a set of contextual data.
The above Eval shows how to the the hallucination template for Eval detection.
Benchmarking Chunk Size, K and Retrieval Approach
The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.
The core use case for RAG (Retrieval Augmented Generation) is the connecting of an LLM to private data, empower an LLM to know your data and respond based on the private data you fit into the context window.
As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.
The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.
This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.
This overview will run through the scripts in phoenix for performance analysis of RAG setup:
The scripts above power the included notebook.
The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.
The core issue of retrieval performance: The chunks returned might or might not be able to answer your main question. They might be semantically similar but not usable to create an answer the question!
The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?
The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.
The Evals generated on each chunk can then be used to generate more traditional search and retreival metrics for the retrieval system. We highly recommend that teams at least look at traditional search and retrieval metrics such as:
MRR
Precision @ K
NDCG
These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.
These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.
Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.
The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.
The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.
The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.
The results from the runs will be available in the directory:
experiment_data/
Underneath experiment_data there are two sets of metrics:
The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.
experiment_data/results_zero_removed
The second set of results is unfiltered and shows the raw metrics for every retrieval.
experiment_data/results_zero_not_removed
The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.
The above graphs show MRR results across a sweep of different chunk sizes.
How to create Phoenix datasets and schemas for the corpus data
Below is an example dataframe containing Wikipedia articles along with its embedding vector.
Below is an appropriate schema for the dataframe above. It specifies the id
column and that embedding
belongs to text
. Other columns, if exist, will be detected automatically, and need not be specified by the schema.
Define the dataset by pairing the dataframe with the schema.
If you want to contribute to the cutting edge of LLM and ML Observability, you've come to the right place!
To get started, please check out the following:
In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.
A Core reviewer will review your PR in around one business day and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github 🥳.
Your PR is now merged into Phoenix! We’ll shout out your contribution in the release notes.
Easily share data when you discover interesting insights so your data science team can perform further investigation or kickoff retraining workflows.
Oftentimes, the team that notices an issue in their model, for example a prompt/response LLM model, may not be the same team that continues the investigations or kicks off retraining workflows.
With a few lines of Python code, users can export this data into Phoenix for further analysis. This allows team members, such as data scientists, who may not have access to production data today, an easy way to access relevant product data for further analysis in an environment they are familiar with.
They can then easily augment and fine tune the data and verify improved performance, before deploying back to production.
Evaluating LLM outputs is best tackled by using a separate evaluation LLM. The Phoenix is designed for simple, fast, and accurate LLM-based evaluations.
Phoenix provides pretested eval templates and convenience functions for a set of common Eval “tasks”. Learn more about pretested templates . This library is split into high-level functions to easily run rigorously and building blocks to modify and .
The Phoenix team is dedicated to testing model and template combinations and is continually improving templates for optimized performance. Find the most up-to-date template on .
In Retrieval-Augmented Generation (RAG), the retrieval step returns from a (proprietary) knowledge base (a.k.a. ) a list of documents relevant to the user query, then the generation step adds the retrieved documents to the prompt context to improve response accuracy of the Large Language Model (LLM). The IDs of the retrieval documents along with the relevance scores, if present, can be imported into Phoenix as follows.
Below shows only the relevant subsection of the dataframe. The retrieved_document_ids
should matched the id
s in the data. Note that for each row, the list under the relevance_scores
column have a matching length as the one under the retrievals
column. But it's not necessary for all retrieval lists to have the same length.
A set of to run with or to share with a teammate
Notice that the get_spans_dataframe
method supports a Python expression as an optional str
parameter so you can filter down your data to specific traces you care about. For full details, consult the .
To export all clusters of embeddings as a single dataframe (labeled by cluster), click the ...
icon on the top right of the screen and click export. Your data will be available either as a Parquet file or is available back in your notebook via your as a dataframe.
In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. A corpus dataset can be imported into Phoenix as shown below.
The launcher accepts the corpus dataset through corpus=
parameter.
We encourage you to start with an issue labeled with the tag on theGitHub issue board, to get familiar with our codebase as a first-time contributor.
To submit your code, , create a on your fork, and open once your work is ready for review.
To help connect teams and workflows, Phoenix enables continued analysis of production data from in a notebook environment for fine tuning workflows.
For example, a user may have noticed in that this prompt template is not performing well.
There are two ways export data out of for further investigation:
The easiest way is to click the export button on the Embeddings and Datasets pages. This will produce a code snippet that you can copy into a Python environment and install Phoenix. This code snippet will include the date range you have selected in the platform, in addition to the datasets you have selected.
Users can also query for data directly using the Arize Python export client. We recommend doing this once you're more comfortable with the in-platform export functionality, as you will need to manually enter in the data ranges and datasets you want to export.
Precision
0.93
0.89
0.89
1
0.80
Recall
0.72
0.65
0.80
0.44
0.95
F1
0.82
0.75
0.84
0.61
0.87
1
Voyager 2 is a spacecraft used by NASA to expl...
[-0.02785328, -0.04709944, 0.042922903, 0.0559...
2
The Staturn Nebula is a planetary nebula in th...
[0.03544901, 0.039175965, 0.014074919, -0.0307...
3
Eris is a dwarf planet and a trans-Neptunian o...
[0.05506449, 0.0031612846, -0.020452883, -0.02...
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.
question: This is the question the Q&A system is running against
sampled_answer: This is the answer from the Q&A system.
context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer
The above Eval uses the QA template for Q&A analysis on retrieved data.
Precision
1
0.99
0.42
1
1.0
Recall
0.92
0.83
1
0.94
0.64
Precision
0.96
0.90
0.59
0.97
0.78
The LLM Evals library is designed to support the building of any custom Eval templates.
Follow the following steps to easily build your own Eval with Phoenix
To do that, you must identify what is the metric best suited for your use case. Can you use a pre-existing template or do you need to evaluate something unique to your use case?
Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.
Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above)
The Evals dataset is designed or easy benchmarking and pre-set downloadable test datasets. The datasets are pre-tested, many are hand crafted and designed for testing specific Eval tasks.
Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.
Now comes the core component that we are trying to benchmark and improve: the eval template.
You can adjust an existing template or build your own from scratch.
Be explicit about the following:
What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
In order to create a new template all that is needed is the setting of the input string to the Eval function.
The above template shows an example creation of an easy to use string template. The Phoenix Eval templates support both strings and objects.
The above example shows a use of the custom created template on the df dataframe.
You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.
This Eval checks the correctness and readability of the code from a code generation process. The template variables are:
query: The query is the coding question being asked
code: The code is the code that was returned.
The above shows how to use the code readability template.
Precision
0.93
0.76
0.67
0.77
Recall
0.78
0.93
1
0.94
F1
0.85
0.85
0.81
0.85
Instrument calls to the OpenAI Python Library
Phoenix currently supports calls to the ChatCompletion
interface, but more are planned soon.
To view OpenInference traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:
Once you have started a Phoenix server, you can instrument the openai
Python library using the OpenAIInstrumentor
class.
All subsequent calls to the ChatCompletion
interface will now report informational spans to Phoenix. These traces and spans are viewable within the Phoenix UI.
If you would like to save your traces to a file for later use, you can directly extract the traces from the tracer
To directly extract the traces from the tracer
, dump the traces from the tracer into a file (we recommend jsonl
for readability).
Now you can save this file for later inspection. To launch the app with the file generated above, simply pass the contents in the file above via a TraceDataset
In this way, you can use files as a means to store and communicate interesting traces that you may want to use to share with a team or to use later down the line to fine-tune an LLM or model.
Quickly explore Phoenix with concrete examples
Phoenix ships with a collection of examples so you can quickly try out the app on concrete use-cases. This guide shows you how to download, inspect, and launch the app with example datasets.
To see a list of datasets available for download, run
This displays the docstring for the phoenix.load_example
function, which contain a list of datasets available for download.
Choose the name of a dataset to download and pass it as an argument to phoenix.load_example
. For example, run the following to download production and training data for our demo sentiment classification model:
px.load_example
returns your downloaded data in the form of an ExampleDatasets
instance. After running the code above, you should see the following in your cell output.
Next, inspect the name, dataframe, and schema that define your primary dataset. First, run
to see the name of the dataset in your cell output:
Next, run
to see your dataset's schema in the cell output:
Last, run
to get an overview of your dataset's underlying dataframe:
Launch Phoenix with
Follow the instructions in the cell output to open the Phoenix UI in your notebook or in a separate browser tab.
How to define your dataset(s), launch a session, open the UI in your notebook or browser, and close your session when you're done
If you additionally have a dataframe ref_df
and a matching ref_schema
, you can define a dataset named "reference" with
Use phoenix.launch_app
to start your Phoenix session in the background. You can launch Phoenix with zero, one, or two datasets.
You can view and interact with the Phoenix UI either directly in your notebook or in a separate browser tab or window.
In a notebook cell, run
Copy and paste the output URL into a new browser tab or window.
In a notebook cell, run
The Phoenix UI will appear in an inline frame in the cell output.
When you're done using Phoenix, gracefully shut down your running background session with
Yes, you can use either of the two methods below.
Install pyngrok on the remote machine using the command pip install pyngrok
.
In jupyter notebook, after launching phoenix set its port number as the port
parameter in the code below. Preferably use a default port for phoenix so that you won't have to set up ngrok tunnel every time for a new port, simply restarting phoenix will work on the same ngrok URL.
"Visit Site" using the newly printed public_url
and ignore warnings, if any.
Ngrok free account does not allow more than 3 tunnels over a single ngrok agent session. Tackle this error by checking active URL tunnels using ngrok.get_tunnels()
and close the required URL tunnel using ngrok.disconnect(public_url)
.
This assumes you have already set up ssh on both the local machine and the remote server.
If you are accessing a remote jupyter notebook from a local machine, you can also access the phoenix app by forwarding a local port to the remote server via ssh. In this particular case of using phoenix on a remote server, it is recommended that you use a default port for launching phoenix, say DEFAULT_PHOENIX_PORT
.
Launch the phoenix app from jupyter notebook.
In a new terminal or command prompt, forward a local port of your choice from 49152 to 65535 (say 52362
) using the command below. Remote user of the remote host must have sufficient port-forwarding/admin privileges.
If you are abruptly unable to access phoenix, check whether the ssh connection is still alive by inspecting the terminal. You can also try increasing the ssh timeout settings.
Simply run exit
in the terminal/command prompt where you ran the port forwarding command.
This Eval helps evaluate the summarization results of a summarization task. The template variables are:
document: The document text to summarize
summary: The summary of the document
The above shows how to use the summarization Eval template.
Extract OpenInference inferences and traces to visualize and troubleshoot your LLM Application in Phoenix
Traces provide telemetry data about the execution of your LLM application. They are a great way to understand the internals of your LangChain application and to troubleshoot problems related to things like retrieval and tool execution.
To extract traces from your LangChain application, you will have to add Phoenix's OpenInference Tracer to your LangChain application. A tracer is a class that automatically accumulates traces (sometimes referred to as spans
) as your application executes. The OpenInference Tracer is a tracer that is specifically designed to work with Phoenix and by default exports the traces to a locally running phoenix server.
To view traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:
Once you have started a Phoenix server, you can start your LangChain application with the OpenInference Tracer as a callback. There are two ways of adding the `tracer` to your LangChain application - by instrumenting all your chains in one go (recommended) or by adding the tracer to as a callback to just the parts that you care about (not recommended).
By adding the tracer to the callbacks of LangChain, we've created a one-way data connection between your LLM application and Phoenix. This is because by default the OpenInferenceTracer
uses an HTTPExporter
to send traces to your locally running Phoenix server! In this scenario the Phoenix server is serving as a Collector
of the spans that are exported from your LangChain application.
To view the traces in Phoenix, simply open the UI in your browser.
If you would like to save your traces to a file for later use, you can directly extract the traces from the tracer
To directly extract the traces from the tracer
, dump the traces from the tracer into a file (we recommend jsonl
for readability).
Now you can save this file for later inspection. To launch the app with the file generated above, simply pass the contents in the file above via a TraceDataset
In this way, you can use files as a means to store and communicate interesting traces that you may want to use to share with a team or to use later down the line to fine-tune an LLM or model.
For a fully working example of tracing with LangChain, checkout our colab notebook.
Phoenix supports visualizing LLM application inference data from a LangChain application. In particular you can use Phoenix's embeddings projection and clustering to troubleshoot retrieval-augmented generation. For a tutorial on how to extract embeddings and inferences from LangChain, check out the following notebook.
Meaning, Examples and How To Compute
Embeddings are vector representations of information. (e.g. a list of floating point numbers). With embeddings, the distance between two vectors carry semantic meaning: Small distances suggest high relatedness and large distances suggest low relatedness. Embeddings are everywhere in modern deep learning, such as transformers, recommendation engines, layers of deep neural networks, encoders, and decoders.
Embeddings are foundational to machine learning because:
Embeddings can represent various forms of data such as images, audio signals, and even large chunks of structured data.
They provide a common mathematical representation of your data
They compress data
They preserve relationships within your data
They are the output of deep learning layers providing comprehensible linear views into complex non-linear relationships learned by models
Embedding vectors are generally extracted from the activation values of one or many hidden layers of your model. In general, there are many ways of obtaining embedding vectors, including:
Word embeddings
Autoencoder Embeddings
Generative Adversarial Networks (GANs)
Pre-trained Embeddings
Once you have chosen a model to generate embeddings, the question is: how? Here are few use-case based examples. In each example you will notice that the embeddings are generated such that the resulting vector represents your input according to your use case.
If you are working on image classification, the model will take an image and classify it into a given set of categories. Each of our embedding vectors should be representative of the corresponding entire image input.
First, we need to use a feature_extractor
that will take an image and prepare it for the large pre-trained image model.
Then, we pass the results from the feature_extractor
to our model
. In PyTorch, we use torch.no_grad()
since we don't need to compute the gradients for backward propagation, we are not training the model in this example.
It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.
Finally, since we want the embedding vector to represent the entire image, we will average across the second dimension, representing the areas of the image.
If you are working on NLP sequence classification (for example, sentiment classification), the model will take a piece of text and classify it into a given set of categories. Hence, your embedding vector must represent the entire piece of text.
For this example, let us assume we are working with a model from the BERT
family.
First, we must use a tokenizer
that will the text and prepare it for the pre-trained large language model (LLM).
Then, we pass the results from the tokenizer
to our model
. In PyTorch, we use torch.no_grad()
since we don't need to compute the gradients for backward propagation, we are not training the model in this example.
It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.
Finally, since we want the embedding vector to represent the entire piece of text for classification, we will use the vector associated with the classification token,[CLS]
, as our embedding vector.
If you are working on NLP Named Entity Recognition (NER), the model will take a piece of text and classify some words within it into a given set of entities. Hence, each of your embedding vectors must represent a classified word or token.
For this example, let us assume we are working with a model from the BERT
family.
First, we must use a tokenizer
that will the text and prepare it for the pre-trained large language model (LLM).
Then, we pass the results from the tokenizer
to our model
. In PyTorch, we use torch.no_grad()
since we don't need to compute the gradients for backward propagation, we are not training the model in this example.
It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.
Further, since we want the embedding vector to represent any given token, we will use the vector associated with a specific token in the piece of text as our embedding vector. So, let token_index
be the integer value that locates the token of interest in the list of tokens that result from passing the piece of text to the tokenizer
. Let ex_index
the integer value that locates a given example in the batch. Then,
Learn how Phoenix fits into your ML stack and how to incorporate Phoenix into your workflows.
Phoenix is designed to run locally on a single server in conjunction with the Notebook.
Phoenix runs locally, close to your data, in an environment that interfaces to Notebook cells on the Notebook server. Designing Phoenix to run locally, enables fast iteration on top of local data.
In order to use Phoenix:
Load data into pandas dataframe
Start Phoenix
Single dataframe
Investigate problems
(Optional) Export data
Phoenix is typically started in a notebook from which a local Phoenix server is kicked off. Two approaches can be taken to the overall use of Phoenix:
Single Dataset
In the case of a team that only wants to investigate a single dataset for exploratory data analysis (EDA), a single dataset instantiation of Phoenix can be used. In this scenario, a team is normally analyzing the data in an exploratory manner and is not doing A/B comparisons.
Two Datasets
A common use case in ML is for teams to have 2x datasets they are comparing such as: training vs production, model A vs model B, OR production time X vs production time Y, just to name a few. In this scenario there exists a primary and reference dataset. When using the primary and reference dataset, Phoenix supports drift analysis, embedding drift and many different A/B dataset comparisons.
Once instantiated, teams can dive into Phoenix on a feature by feature basis, analyzing performance and tracking down issues.
Once an issue is found, the cluster can be exported back into a dataframe for further analysis. Clusters can be used to create groups of similar data points for use downstream, these include:
Finding Similar Examples
Monitoring
Steering Vectors / Steering Prompts
The above picture shows the use of Phoenix with a cloud observability system (this is not required). In this example the cloud observability system allows the easy download (or synchronization) of data to the Notebook typically based on model, batch, environment, and time ranges. Normally this download is done to analyze data at the tail end of troubleshooting workflow, or periodically to use the notebook environment to monitor your models.
Once in a notebook environment the downloaded data can power Observability workflows that are highly interactive. Phoenix can be used to find clusters of data problems and export those clusters back to the Observability platform for use in monitoring and active learning workflows.
Note: Data can also be downloaded from any data warehouse system for use in Phoenix without the requirement of a cloud ML observability solution.
In the first version of Phoenix it is assumed the data is available locally but we’ve also designed it with some broader visions in mind. For example, Phoenix was designed with a stateless metrics engine as a first class citizen, enabling any metrics checks to be run in any python data pipeline.
The implements Python bindings for OpenAI's popular suite of models. Phoenix provides utilities to instrument calls to OpenAI's API, enabling deep observability into the behavior of an LLM application build on top on these models.
collect telemetry data about the execution of your LLM application. Consider using this instrumentation to understand how a OpenAI model is being called inside a complex system and to troubleshoot issues such as extraction and response synthesis. These traces can also help debug operational issues such as rate limits, authentication issues or improperly set model parameters.
Have a OpenAI API you would like to see instrumented? Drop us a
Phoenix supports and has examples that you can take a look at as well.\
For a conceptual overview of datasets, including an explanation of when to use a single dataset vs. primary and reference datasets, see .
To define a dataset, you must load your data into a pandas dataframe and . If you have a dataframe prim_df
and a matching prim_schema
, you can define a dataset named "primary" with
See if you have corpus data for an Information Retrieval use case.
You can set the default port for phoenix each time you launch the application from jupyter notebook with an optional argument port
in .
on ngrok and verify your email. Find 'Your Authtoken' on the .
If successful, visit to access phoenix locally.
Phoenix has first-class support for applications. This means that you can easily extract inferences and traces from your LangChain application and visualize them in Phoenix.
We recommend that you instrument your entire LangChain application to maximize visibility. To do this, we will use the LangChainInstrumentor
to add the OpenInferenceTracer
to every chain in your application.
If you only want traces from parts of your application, you can pass in the tracer to the parts that you care about.
Embeddings are used for a variety of machine learning problems. To learn more, check out our course .
Given the wide accessibility to pre-trained transformer , we will focus on generating embeddings using them. These models are models such as BERT or GPT-x, models that are trained on a large datasets and that are fine-tuning them on a specific task.
(Optional) Leverage embeddings and LLM eval generators
(Optional) Two dataframes: primary and
Phoenix currently requires pandas dataframes which can be downloaded from either an ML observability platform, a table or a raw log file. The data is assumed to be formatted in the format with a well defined column structure, normally including a set of inputs/features, outputs/predictions and ground truth.
The Phoenix library heavily uses as a method for data visualization and debugging. In order to use Phoenix with embeddings they can either be generated using an SDK call or they can be supplied by the user of the library. Phoenix supports embeddings for LLMs, Image, NLP, and tabular datasets.
Phoenix is designed to monitor, analyze and troubleshoot issues on top of your model data allowing for workflows all within a Notebook environment.
Precision
0.79
1
1
0.57
0.75
Recall
0.88
0.1
0.16
0.7
0.61
F1
0.83
0.18
0.280
0.63
0.67
No Dataset
Run Phoenix in the background to collect OpenInference traces emitted by your instrumented LLM application.
Single Dataset
Analyze a single cohort of data, e.g., only training data.
Check model performance and data quality, but not drift.
Primary and Reference Datasets
Compare cohorts of data, e.g., training vs. production.
Analyze drift in addition to model performance and data quality.
Compare a query dataset to a corpus dataset to analyze your retrieval-augmented generation applications.
Learn the foundational concepts of the Phoenix API and Application
This section introduces datasets and schemas, the starting concepts needed to use Phoenix.
A Phoenix dataset is an instance of phoenix.Dataset
that contains three pieces of information:
The data itself (a pandas dataframe)
A dataset name that appears in the UI
For example, if you have a dataframe prod_df
that is described by a schema prod_schema
, you can define a dataset prod_ds
with
If you launch Phoenix with this dataset, you will see a dataset named "production" in the UI.
You can launch Phoenix with zero, one, or two datasets.
With no datasets, Phoenix runs in the background and collects trace data emitted by your instrumented LLM application. With a single dataset, Phoenix provides insights into model performance and data quality. With two datasets, Phoenix compares your datasets and gives insights into drift in addition to model performance and data quality, or helps you debug your retrieval-augmented generation applications.
Your reference dataset provides a baseline against which to compare your primary dataset.
To compare two datasets with Phoenix, you must select one dataset as primary and one to serve as a reference. As the name suggests, your primary dataset contains the data you care about most, perhaps because your model's performance on this data directly affects your customers or users. Your reference dataset, in contrast, is usually of secondary importance and serves as a baseline against which to compare your primary dataset.
Very often, your primary dataset will contain production data and your reference dataset will contain training data. However, that's not always the case; you can imagine a scenario where you want to check your test set for drift relative to your training data, or use your test set as a baseline against which to compare your production data. When choosing primary and reference datasets, it matters less where your data comes from than how important the data is and what role the data serves relative to your other data.
For example, if you have a dataframe containing Fisher's Iris data that looks like this:
7.7
3.0
6.1
2.3
virginica
versicolor
5.4
3.9
1.7
0.4
setosa
setosa
6.3
3.3
4.7
1.6
versicolor
versicolor
6.2
3.4
5.4
2.3
virginica
setosa
5.8
2.7
5.1
1.9
virginica
virginica
your schema might look like this:
Usually one, sometimes two.
Each dataset needs a schema. If your primary and reference datasets have the same format, then you only need one schema. For example, if you have dataframes train_df
and prod_df
that share an identical format described by a schema named schema
, then you can define datasets train_ds
and prod_ds
with
Sometimes, you'll encounter scenarios where the formats of your primary and reference datasets differ. For example, you'll need two schemas if:
Your production data has timestamps indicating the time at which an inference was made, but your training data does not.
A new version of your model has a differing set of features from a previous version.
In cases like these, you'll need to define two schemas, one for each dataset. For example, if you have dataframes train_df
and prod_df
that are described by schemas train_schema
and prod_schema
, respectively, then you can define datasets train_ds
and prod_ds
with
Phoenix runs as an application that can be viewed in a web browser tab or within your notebook as a cell. To launch the app, simply pass one or more datasets into the launch_app
function:
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark datasets.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
GPT-4
✔
GPT-3.5 Turbo
✔
GPT-3.5 Instruct
✔
Azure Hosted Open AI
✔
Palm 2 Vertex
✔
AWS Bedrock
✔
Litellm
(coming soon)
Huggingface Llama7B
(coming soon)
Anthropic
(coming soon)
Cohere
(coming soon)
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of datasets.
The above approach allows us to compare models easily in an understandable format:
Precision
0.94
0.94
Recall
0.75
0.71
F1
0.83
0.81
The following shows the results of the toxicity Eval on a toxic dataset test to identify if the AI response is racist, biased, or toxic. The template variables are:
text: the text to be classified
The above is the use of the RAG relevancy template.
Note: Palm is not useful for Toxicity detection as it always returns "" string for toxic inputs
Precision
0.91
0.93
0.95
No response for toxic input
0.86
Recall
0.91
0.83
0.79
No response for toxic input
0.40
F1
0.91
0.87
0.87
No response for toxic input
0.54
But what if I don't have embeddings handy? Well, that is not a problem. The model data can be analyzed by the embeddings Auto-Generated for Phoenix.
We support generating embeddings for you for the following types of data:
CV - Computer Vision
NLP - Natural Language
Tabular Data - Pandas Dataframes
We extract the embeddings in the appropriate way depending on your use case, and we return it to you to include in your pandas dataframe, which you can then analyze using Phoenix.
Auto-Embeddings works end-to-end, you don't have to worry about formatting your inputs for the correct model. By simply passing your input, an embedding will come out as a result. We take care of everything in between.
If you want to use this functionality as part of our Python SDK, you need to install it with the extra dependencies using pip install arize[AutoEmbeddings]
.
You can get an updated table listing of supported models by running the line below.
Auto-Embeddings is designed to require minimal code from the user. We only require two steps:
Create the generator: you simply instantiate the generator using EmbeddingGenerator.from_use_case()
and passing information about your use case, the model to use, and more options depending on the use case; see examples below.
Let Arize generate your embeddings: obtain your embeddings column by calling generator.generate_embedding()
and passing the column containing your inputs; see examples below.
Arize expects the dataframe's index to be sorted and begin at 0. If you perform operations that might affect the index prior to generating embeddings, reset the index as follows:
This Eval evaluates whether a retrieved chunk contains an answer to the query. It's extremely useful for evaluating retrieval systems.
The above runs the RAG relevancy LLM template against the dataframe df.
Precision
0.70
0.42
0.53
0.79
Recall
0.88
1.0
1
0.22
F1
0.78
0.59
0.69
0.34
Using LLMs to extract structured data from unstructured text
Open AI Functions
Data extraction tasks using LLMs, such as scraping text from documents or pulling key information from paragraphs, are on the rise. Using an LLM for this task makes sense - LLMs are great at inherently capturing the structure of language, so extracting that structure from text using LLM prompting is a low cost, high scale method to pull out relevant data from unstructured text.
One approach is using a flattened schema. Let's say you're dealing with extracting information for a trip planning application. The query may look something like:
User: I need a budget-friendly hotel in San Francisco close to the Golden Gate Bridge for a family vacation. What do you recommend?
As the application designer, the schema you may care about here for downstream usage could be a flattened representation looking something like:
With the above extracted attributes, your downstream application can now construct a structured query to find options that might be relevant to the user.
The ChatCompletion
call to Open AI would look like
You can use phoenix spans and traces to inspect the invocation parameters of the function to
verify the inputs to the model in form of the the user message
verify your request to Open AI
verify the corresponding generated outputs from the model match what's expected from the schema and are correct
Point level evaluation is a great starting point, but verifying correctness of extraction at scale or in a batch pipeline can be challenging and expensive. Evaluating data extraction tasks performed by LLMs is inherently challenging due to factors like:
The diverse nature and format of source data.
The potential absence of a 'ground truth' for comparison.
The intricacies of context and meaning in extracted data.
Inspect the inner-workings of your LLM Application using OpenInference Traces
The easiest method of using Phoenix traces with LLM frameworks (or direct OpenAI API) is to stream the execution of your application to a locally running Phoenix server. The traces collected during execution can then be stored for later use for things like validation, evaluation, and fine-tuning.
In Memory: useful for debugging.
Cloud (coming soon): Store your cloud buckets as as assets for later use
To get started with traces, you will first want to start a local Phoenix app.
The above launches a Phoenix server that acts as a trace collector for any LLM application running locally.
Once you've executed a sufficient number of queries (or chats) to your application, you can view the details of the UI by refreshing the browser url
There are two ways to extract trace dataframes. The two ways for LangChain are described below.
In addition to launching phoenix on LlamaIndex and LangChain, teams can export trace data to a dataframe in order to run LLM Evals on the data.
Phoenix can be used to understand and troubleshoot your by surfacing:
Application latency - highlighting slow invocations of LLMs, Retrievers, etc.
Token Usage - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls
Runtime Exceptions - Critical runtime exceptions such as rate-limiting are captured as exception events.
Retrieved Documents - view all the documents retrieved during a retriever call and the score and order in which they were returned
Embeddings - view the embedding text used for retrieval and the underlying embedding model
LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts
Prompt Templates - Figure out what prompt template is used during the prompting step and what variables were used.
Tool Descriptions - view the description and function signature of the tools your LLM has been given access to
LLM Function Calls - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.\
Primary and Datasets
For comprehensive descriptions of phoenix.Dataset
and phoenix.Schema
, see the .
For tips on creating your own Phoenix datasets and schemas, see the .
A (a phoenix.Schema
instance) that describes the of your dataframe
The only difference for the dataset is that it needs a separate schema because it have a different set of columns compared to the model data. See the section for more details.
A Phoenix schema is an instance of phoenix.Schema
that maps the of your dataframe to fields that Phoenix expects and understands. Use your schema to tell Phoenix what the data in your dataframe means.
Your training data has (what we call actuals in Phoenix nomenclature), but your production data does not.
A dataset, containing documents for information retrieval, typically has a different set of columns than those found in the model data from either production or training, and requires a separate schema. Below is an example schema for a corpus dataset with three columns: the id
, text
, and embedding
for each document in the corpus.
The application provide you with a landing page that is populated with your model's schema
(e.g. the features, tags, predictions, and actuals). This gives you a statistical overview of your data as well as links into the views for analysis.
All evals templates are tested against golden datasets that are available as part of the LLM eval library's and target precision at 70-90% and F1 at 70-85%.
We currently support a growing set of models for LLM Evals, please check out the .
Phoenix supports any type of dense generated for almost any type of data.
Generating embeddings is likely another problem to solve, on top of ensuring your model is performing properly. With our Python , you can offload that task to the SDK and we will generate the embeddings for you. We use large, pre-trained that will capture information from your inputs and encode it into embedding vectors.
Structured extraction is a place where it’s simplest to work directly with the . Open AI functions for structured data extraction recommends providing the following JSON schema object in the form ofparameters_schema
(the desired fields for structured data output).
To learn more about how to evaluate structured extraction applications, !
The can be collected and stored in the following ways:
Local File: Persistent and good for offline local development. See
The launch_app
command will spit out a URL for you to view the Phoenix UI. You can access this url again at any time via the .
Now that phoenix is up and running, you can now run a or application OR just run the OpenAI API and debug your application as the traces stream in.
If you are using llama-index>0.8.36
you will be able to instrument your application with LlamaIndex's observability.
Phoenix also support datasets that contain data. This allows data from a LangChain and LlamaIndex running instance explored for analysis offline.
For full details on how to export trace data, see
For full details, check out the relevance example of the relevance .
are a powerful way to troubleshoot and understand your application and can be leveraged to the quality of your application. For a full list of notebooks that illustrate this in full-color, please check out the .
See the for the full details as well as support for older versions of LlamaIndex
See the for details
Use Zero Datasets When:
You want to run Phoenix in the background to collect trace data from your instrumented LLM application.
Use a Single Dataset When:
You have only a single cohort of data, e.g., only training data.
You care about model performance and data quality, but not drift.
Use Two Datasets When:
You want to compare cohorts of data, e.g., training vs. production.
You care about drift in addition to model performance and data quality.
Retrieval Eval
Tested on:
MS Marco, WikiQA
Hallucination Eval
Tested on:
Hallucination QA Dataset, Hallucination RAG Dataset
Toxicity Eval
Tested on:
WikiToxic
Q&A Eval
Tested on:
WikiQA
Summarization Eval
Tested on:
GigaWorld, CNNDM, Xsum
Code Generation Eval
Tested on:
WikiSQL, HumanEval, CodeXGlu
Observability for all model types (LLM, NLP, CV, Tabular)
Phoenix Inferences allows you to observe the performance of your model through visualizing all the model’s inferences in one interactive UMAP view.
This powerful visualization can be leveraged during EDA to understand model drift, find low performing clusters, uncover retrieval issues, and export data for retraining / fine tuning.
The following Quickstart can be executed in a Jupyter notebook or Google Colab.
We will begin by logging just a training set. Then proceed to add a production set for comparison.
Use pip
or conda
to install arize-phoenix
.
Phoenix visualizes data taken from pandas dataframe, where each row of the dataframe compasses all the information about each inference (including feature values, prediction, metadata, etc.)
Let’s begin by working with the training set for this model.
Download the dataset and load it into a Pandas dataframe.
Preview the dataframe with train_df.head()
and note that each row contains all the data specific to this CV model for each inference.
Before we can log this dataset, we need to define a Schema object to describe this dataset.
The Schema object informs Phoenix of the fields that the columns of the dataframe should map to.
Here we define a Schema to describe our particular CV training set:
Important: The fields used in a Schema will vary depending on the model type that you are working with.
Wrap your train_df
and schema train_schema
into a Phoenix Dataset object:
We are now ready to launch Phoenix with our Dataset!
Here, we are passing train_ds
as the primary
dataset, as we are only visualizing one dataset (see Step 6 for adding additional datasets).
Running this will fire up a Phoenix visualization. Follow in the instructions in the output to view Phoenix in a browser, or in-line in your notebook:
You are now ready to observe the training set of your model!
Optional - try the following exercises to familiarize yourself more with Phoenix:
We will continue on with our CV model example above, and add a set of production data from our model to our visualization.
This will allow us to analyze drift and conduct A/B comparisons of our production data against our training set.
Note that this schema differs slightly from our train_schema
above, as our prod_df
does not have a ground truth column!
This time, we will include both train_ds
and prod_ds
when calling launch_app
.
Once again, enter your Phoenix app with the new link generated by your session. e.g.
You are now ready to conduct comparative Root Cause Analysis!
Optional - try the following exercises to familiarize yourself more with Phoenix:
Once you have identified datapoints of interest, you can export this data directly from the Phoenix app for further analysis, or to incorporate these into downstream model retraining and finetuning flows.
Once your model is ready for production, you can add Arize to enable production-grade observability. Phoenix works in conjunction with Arize to enable end-to-end model development and observability.
With Arize, you will additionally benefit from:
Being able to publish and observe your models in real-time as inferences are being served, and/or via direct connectors from your table/storage solution
Scalable compute to handle billions of predictions
Ability to set up monitors & alerts
Production-grade observability
Integration with Phoenix for model iteration to observability
Enterprise-grade RBAC and SSO
Experiment with infinite permutations of model versions and filters
You have corpus data for information retrieval. See .
For this Quickstart, we will show an example of visualizing the inferences from a computer vision model. See example notebooks for all model types .
For examples on how Schema are defined for other model types (NLP, tabular, LLM-based applications), see example notebooks under and .
Checkpoint A.
Note that Phoenix automatically generates clusters for you on your data using a clustering algorithm called HDBSCAN (more information: )
Discuss your answers in our !
In order to visualize drift, conduct A/B model comparisons, or in the case of an information retrieval use case, compare inferences against a , you will need to add a comparison dataset to your visualization.
Read more about comparison dataset Schemas here:
For more information, see
Checkpoint B.
Discuss your answers in our !
See more on exporting data:
Create your and see the full suite of features.
Read more about Embeddings Analysis:
Join the to ask questions, share findings, provide feedback, and connect with other developers.
Evals are LLM-powered functions that you can use to evaluate the output of your LLM or generative application
Evals are still under experimental
and must be installed via pip install arize-phoenix[experimental]
Class used to store and format prompt templates.
text (str): The raw prompt text used as a template.
delimiters (List[str]): List of characters used to locate the variables within the prompt template text
. Defaults to ["{", "}"]
.
text (str): The raw prompt text used as a template.
variables (List[str]): The names of the variables that, once their values are substituted into the template, create the prompt text. These variable names are automatically detected from the template text
using the delimiters
passed when initializing the class (see Usage section below).
Define a PromptTemplate
by passing a text
string and the delimiters
to use to locate the variables
. The default delimiters are {
and }
.
If the prompt template variables have been correctly located, you can access them as follows:
The PromptTemplate
class can also understand any combination of delimiters. Following the example above, but getting creative with our delimiters:
Once you have a PromptTemplate
class instantiated, you can make use of its format
method to construct the prompt text resulting from substituting values into the variables
. To do so, a dictionary mapping the variable names to the values is passed:
Note that once you initialize the PromptTemplate
class, you don't need to worry about delimiters anymore, it will be handled for you.
Classifies each input row of the dataframe
using an LLM. Returns a pandas.DataFrame
where the first column is named label
and contains the classification labels. An optional column named explanation
is added when provide_explanation=True
.
dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be classified. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (PromptTemplate or str): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to .format
can be made to substitute variable values.
model (BaseEvalModel): An LLM model class instance
rails (List[str]): A list of strings representing the possible output classes of the model's predictions.
system_instruction (Optional[str]): An optional system message for modals that support it
verbose (bool, optional): If True
, prints detailed info to stdout such as model invocation parameters and details about retries and snapping to rails. Default False
.
use_function_calling_if_available (bool, default=True): If True
, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.
provide_explanation (bool, default=False): If True
, provides an explanation for each classification label. A column named explanation
is added to the output dataframe. Currently, this is only available for models with function calling.
pandas.DataFrame: A dataframe where the label
column (at column position 0) contains the classification labels. If provide_explanation=True
, then an additional column named explanation
is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or "NOT_PARSABLE" if the model's output could not be parsed.
Given a pandas dataframe containing queries and retrieved documents, classifies the relevance of each retrieved document to the corresponding query using an LLM.
dataframe (pd.DataFrame): A pandas dataframe containing queries and retrieved documents. If both query_column_name and reference_column_name are present in the input dataframe, those columns are used as inputs and should appear in the following format:
The entries of the query column must be strings.
The entries of the documents column must be lists of strings. Each list may contain an arbitrary number of document texts retrieved for the corresponding query.
If the input dataframe is lacking either query_column_name or reference_column_name but has query and retrieved document columns in OpenInference trace format named "attributes.input.value" and "attributes.retrieval.documents", respectively, then those columns are used as inputs and should appear in the following format:
The entries of the query column must be strings.
The entries of the document column must be lists of OpenInference document objects, each object being a dictionary that stores the document text under the key "document.content".
model (BaseEvalModel): The model used for evaluation.
template (Union[PromptTemplate, str], optional): The template used for evaluation.
rails (List[str], optional): A list of strings representing the possible output classes of the model's predictions.
query_column_name (str, optional): The name of the query column in the dataframe, which should also be a template variable.
reference_column_name (str, optional): The name of the document column in the dataframe, which should also be a template variable.
system_instruction (Optional[str], optional): An optional system message.
evaluations (List[List[str]]): A list of relevant and not relevant classifications. The "shape" of the list should mirror the "shape" of the retrieved documents column, in the sense that it has the same length as the input dataframe and each sub-list has the same length as the corresponding list in the retrieved documents column. The values in the sub-lists are either entries from the rails argument or "NOT_PARSABLE" in the case where the LLM output could not be parsed.
Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses
dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be used as in input to the template. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (Union[PromptTemplate, str]): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to format
can be made to substitute variable values.
model (BaseEvalModel): An LLM model class.
system_instruction (Optional[str], optional): An optional system message.
generations (List[Optional[str]]): A list of strings representing the output of the model for each record
Explore the capabilities of Phoenix with notebook tutorials for concrete use-cases
Trace through the execution of your LLM application to understand its internal structure and to troubleshoot issues with retrieval, tool execution, LLM calls, and more.
Leverage the power of large language models to evaluate your generative model or application for hallucinations, toxicity, relevance of retrieved documents, and more.
Visualize your generative application's retrieval process to surface failed retrievals and to find topics not addressed by your knowledge base.
Explore lower-dimensional representations of your embedding data to identify clusters of high-drift and performance degradation.
Statistically analyze your structured data to perform A/B analysis, temporal drift analysis, and more.
How to connect to OpenInference compliant data via a llama_index callbacks
Traces provide telemetry data about the execution of your LLM application. They are a great way to understand the internals of your LlamaIndex application and to troubleshoot problems related to things like retrieval and tool execution.
To extract traces from your LlamaIndex application, you will have to add Phoenix's OpenInferenceTraceCallback
to your LlamaIndex application. A callback (in this case a OpenInference Tracer
) is a class that automatically accumulates traces (sometimes referred to as spans
) as your application executes. The OpenInference `Tracer`` is a tracer that is specifically designed to work with Phoenix and by default exports the traces to a locally running phoenix server.
To view traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:
Once you have started a Phoenix server, you can start your LlamaIndex application with the OpenInferenceTraceCallback
as a callback. To do this, you will have to add the callback to the initialization of your LlamaIndex application
By adding the callback to the callback manager of LlamaIndex, we've created a one-way data connection between your LLM application and Phoenix. This is because by default the OpenInferenceTraceCallback
uses an HTTPExporter
to send traces to your locally running Phoenix server! In this scenario the Phoenix server is serving as a Collector
of the spans that are exported from your LlamaIndex application.
To view the traces in Phoenix, simply open the UI in your browser.
If you would like to save your traces to a file for later use, you can directly extract the traces from the callback
To directly extract the traces from the callback, dump the traces from the tracer into a file (we recommend jsonl
for readability).
Now you can save this file for later inspection. To launch the app with the file generated above, simply pass the contents in the file above via a TraceDataset
In this way, you can use files as a means to store and communicate interesting traces that you may want to use to share with a team or to use later down the line to fine-tune an LLM or model.
For a fully working example of tracing with LlamaIndex, checkout our colab notebook.
With a few lines of code, you can mount the OpenInferenceCallback to your application\
If you are running the chatbot in a notebook, you can simply flush the callback buffers to dataframes. Phoenix natively supports parsing OpenInference so there is no need to define a schema for your dataset.
In a production setting, LlamaIndex application maintainers can log the data generated by their system by implementing and passing a custom callback
to OpenInferenceCallbackHandler
. The callback is of type Callable[List[QueryData]]
that accepts a buffer of query data from the OpenInferenceCallbackHandler
, persists the data (e.g., by uploading to cloud storage or sending to a data ingestion service), and flushes the buffer after data is persisted.
A reference implementation is included below that periodically writes data in OpenInference format to local Parquet files when the buffer exceeds a certain size.
⚠️ In a production setting, it's important to clear the buffer, otherwise, the callback handler will indefinitely accumulate data in memory and eventually cause your system to crash.
For a fully working example, checkout our colab notebook.
OpenInference is an open standard that encompasses model inference and LLM application tracing.
OpenInference is a specification that encompass two data models:
The OpenInference data format is designed to provide an open interoperable data format for model inference files. Our goal is for modern ML systems, such as model servers and ML Observability platforms, to interface with each other using a common data format.\
The goal of this is to define a specification for production inference logs that can be used on top of many file formats including Parquet, Avro, CSV and JSON. It will also support future formats such as Lance.
An inference store is a common approach to store model inferences, normally stored in a data lake or data warehouse.
NLP
Text Generative - Prompt and Response
Text Classification
NER Span Categorization
Tabular:
Regression
Classification
Classification + Score
Multi-Classification
Ranking
Multi-Output/Label
Time Series Forecasting
CV
Classification
Bounding Box
Segmentation
In an inference store the prediction ID is a unique identifier for a model prediction event. The prediction ID defines the inputs to the model, model outputs, latently linked ground truth (actuals), meta data (tags) and model internals (embeddings and/or SHAP).
In this section we will review a flat (non nested structure) prediction event, the following sections will cover how to handle nested structures.
A prediction event can represent a prompt response pair for LLMs where the conversation ID maintains the thread of conversation.
\
The core components of an inference event are the:
Model input (features/prompt)
Model output (prediction/response)
Ground truth (actuals or latent actuals)
Model ID
Model Version
Environment
Conversation ID
Additional data that may be contained include:
Metadata
SHAP values
Embeddings
Raw links to data
Bounding boxes
The fundamental storage unit in an inference store is an inference event. These events are stored in groups that are logically separated by model ID, model version and environment.
Environment describes where the model is running for example we use environments of training, validation/test and production to describe different places you run a model.
The production environment is commonly a streaming-like environment. It is streaming in the sense that a production dataset has no beginning or end. The data can be added to it continuously. In most production use cases data is added in small mini batches or real time event-by-event.
The training and validation environments are commonly used to send data in batches. These batches define a group of data for analysis purposes. It’s common in validation/test and training to have the timestamp be optional.
Note: historical backtesting data comparisons on time series data can require non-runtime settings for timestamp use for training and validation
The model ID is a unique human readable identifier for a model within a workspace - it completely separates the model data between logical instances.
The model version is a logical separator for metrics and analysis used to look at different builds of a model. A model version can capture common changes such as weight updates and feature additions.
Unlike Infra observability, the inference store needs some mutability. There needs to be some way in which ground truth is added or updated for a prediction event.
Ground truth is required in the data in order to analyze performance metrics such as precision, recall, AUC, LogLoss, and Accuracy.
Latent ground truth data may need to be “joined” to a prediction ID to enable performance visualization. In Phoenix, the library requires ground truth to be pre-joined to prediction data. In an ML Observability system such as Arize the joining of ground truth is typically done by the system itself.
The above image shows a common use case in ML Observability in which latent ground truth is received by a system and linked back to the original prediction based on a prediction ID.
In addition to ground truth, latent metadata is also required to be linked to a prediction ID. Latent metadata can be critical to analyze model results using additional data tags linked to the original prediction ID.
Examples of Metadata (Tags):
Loan default amount
Loan status
Revenue from conversion or click
Server region
Images bounding box, NLP NER, and Image segmentation
The above picture shows how a nested set of detections can occur for a single image in the prediction body with bounding boxes within the image itself.
A model may have multiple inputs with different embeddings and images for each generating a prediction class. An example might be an insurance claim event with multiple images and a single prediction estimate for the claim.
The above prediction shows hierarchical data. The current version of Phoenix is designed to ingest a flat structure so teams will need to flatten the above hierarchy. An example of flattening is below.
The example above shows an exploded representation of the hierarchical data. <todo fix, once team reviews approach internally>
OpenInference Tracing provides a detailed and holistic view of the operations happening within an LLM application. It offers a way to understand the "path" or journey a request takes from start to finish, helping in debugging, performance optimization, and ensuring the smooth flow of operations. Tracing takes advantage two key components to instrument your code.
Tracer: Responsible for creating spans that contain information about various operations.
Trace Exporters: These are responsible for sending the generated traces to consumers which can be a standard output for debugging, or an OpenInference Collector such as Phoenix.
OpenInference spans are built on-top of a unit of work called a span
. A span
keeps track of how long the execution of a given LLM application step takes and also can store important information about the step in the form of attributes
. At a high level, a span has:
Span Context: Contains the trace ID (representing the trace the span belongs to) and the span's ID.
Attributes: Key-value pairs containing metadata to annotate a span. They provide insights about the operation being tracked. Semantic attributes offer standard naming conventions for common metadata.
Span Events: Structured log messages on a span, denoting a significant point in time during the span's duration.
Span Status: Attached to a span to denote its outcome as Unset, Ok, or Error.
Span Kind: Provides a hint on how to assemble the trace. Types include:
Chain: Represents the starting point or link between different LLM application steps.
Retriever: Represents a data retrieval step.
LLM: Represents a call to an LLM.
Embedding: Represents a call to an LLM for embedding.
Tool: Represents a call to an external tool.
Agent: Encompasses calls to LLMs and Tools, describing a reasoning block.
Helps answer questions such as: Are there queries that don’t have sufficient context? Should you add more context for these queries to get better answers? Or can you change your embeddings?
There are varying degrees of how we can evaluate retrieval systems.
Step 1: First we care if the chatbot is correctly answering the user's questions. Are there certain types of questions the chatbot gets wrong more often?
Step 2: Once we know there's an issue, then we need metrics to trace where specifically did it go wrong. Is the issue with retrieval? Are the documents that the system retrieves irrelevant?
Step 3: If retrieval is not the issue, we should check if we even have the right documents to answer the question.
Visualize the chain of the traces and spans for a Q&A chatbot use case. You can click into specific spans.
When clicking into the retrieval span, you can see the relevance score for each document. This can surface irrelevant context.
Phoenix surfaces up clusters of similar queries that have poor feedback.
Phoenix can help you identify if there is context that is missing from your knowledge base. By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbots responses.
By setting the "primary" dataset as the user queries, and the "corpus" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example below.
The first thing we need is to collect some sample from your vector store, to be able to compare against later. This is to able to see if some sections are not being retrieved, or some sections are getting a lot of traffic where you might want to beef up your context or documents in that area.
We also will be logging the prompt/response pairs from the deployed application.
(GPT Index) is a data framework for your LLM application. It's a powerful framework by which you can build an application that leverages RAG (retrieval-augmented generation) to super-charge an LLM with your own data. RAG is an extremely powerful LLM application model because it lets you harness the power of LLMs such as OpenAI's GPT but tuned to your data and use-case.
However when building out a retrieval system, a lot can go wrong that can be detrimental to the user-experience of your question and answer system. Phoenix provides two different ways to gain insights into your LLM application: inference records and tracing.
To provide visibility into how your LLM app is performing, we built the . The OpenInferenceCallback captures the internals of the LLM App in buffers that conforms to the format. As your LlamaIndex application, the callback captures the timing, embeddings, documents, and other critical internals and serializes the data to buffers that can be easily materialized as dataframes or as files such as Parquet. Since Phoenix can ingest OpenInference data natively, making it a seamless integration to analyze your LLM powered chatbot. To understand callbacks in details, consult the
For the full guidance on how to materialize your data in files, consult the .
For a in-depth specification of the OpenInference specification, please consult the spec
OpenInference Tracing offers a comprehensive view of the inner workings of an LLM application. By breaking down the process into spans and categorizing each span, it offers a clear picture of the operations and their interrelations, making troubleshooting and optimization easier and more effective. For the full details of OpenInference tracing, please consult the
Possibly the most common use-case for creating a LLM application is to connect an LLM to proprietary data such as enterprise documents or video transcriptions. Applications such as these often times are built on top of LLM frameworks such as or , which have first-class support for vector store retrievers. Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.
Phoenix can help uncover when irrelevant context is being retrieved using the . You can look at a cluster's aggregate relevance metric with precision @k, NDCG, MRR, etc to identify where to improve. You can also look at a single prompt/response pair and see the relevance of documents.
Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Ask chatGPT to help you understand the make up of the cluster.
For more details, visit this .
For more details, visit this .
1
Voyager 2 is a spacecraft used by NASA to expl...
[-0.02785328, -0.04709944, 0.042922903, 0.0559...
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
[7395, 567965, 323794, ...
[11.30, 7.67, 5.85, ...
Neil Armstrong
Tracing and Evaluating a LlamaIndex + OpenAI RAG Application
LlamaIndex
OpenAI
retrieval-augmented generation
Tracing and Evaluating a LlamaIndex OpenAI Agent
LlamaIndex
OpenAI
agents
function calling
Tracing and Evaluating a Structured Data Extraction Application with OpenAI Function Calling
OpenAI
structured data extraction
function calling
Tracing and Evaluating a LangChain + OpenAI RAG Application
LangChain
OpenAI
retrieval-augmented generation
Tracing and Evaluating a LangChain Agent
LangChain
OpenAI
agents
function calling
Tracing and Evaluating a LangChain + Vertex AI RAG Application
LangChain
Vertex AI
retrieval-augmented generation
Tracing and Evaluating a LangChain + Google PaLM RAG Application
LangChain
Google PaLM
retrieval-augmented generation
Evaluating Hallucinations
hallucinations
Evaluating Toxicity
toxicity
Evaluating Relevance of Retrieved Documents
document relevance
Evaluating Question-Answering
question-answering
Evaluating Summarization
summarization
Evaluating Code Readability
code readability
Evaluating and Improving Search and Retrieval Applications
LlamaIndex
retrieval-augmented generation
Evaluating and Improving Search and Retrieval Applications
LlamaIndex
Milvus
retrieval-augmented generation
Evaluating and Improving Search and Retrieval Applications
LangChain
Pinecone
retrieval-augmented generation
Active Learning for a Drifting Image Classification Model
image classification
fine-tuning
Root-Cause Analysis for a Drifting Sentiment Classification Model
NLP
sentiment classification
Troubleshooting an LLM Summarization Task
summarization
Collect Chats with GPT
LLMs
Find Clusters, Export, and Explore with GPT
LLMs
exploratory data analysis
Detecting Fraud with Tabular Embeddings
tabular data
anomaly detection
LangChain
LlamaIndex
Is this a bad response to the answer?
Most relevant way to measure application
Hard to trace down specifically what to fix
Is the retrieved context relevant?
Directly measures effectiveness of retrieval
Requires additional LLMs calls
Is the knowledge base missing areas of user queries?
Query density (drift) - Phoenix generated
Highlights groups of queries with large distance from context
Identifies broad topics missing from knowledge base, but not small gaps
Evaluation model classes powering your LLM Evals
We currently support the following LLM providers:
To authenticate with OpenAI you will need, at a minimum, an API key. Our classes will look for it in your environment, or you can pass it via argument as shown above. In addition, you can choose the specific name of the model you want to use and its configuration parameters. The default values specified above are common default values from OpenAI. Quickly instantiate your model as follows:
To authenticate with VertexAI, you must pass either your credentials or a project, location pair. In the following example, we quickly instantiate the VertexAI model as follows:
To Authenticate, the following code is used to instantiate a session and the session is used with Phoenix Evals
model.generate
If you want to run multiple prompts through the LLM, you can do so via the generate
method
model.agenerate
In addition, you can also run multiple prompts through the LLM asynchronously via the agenerate
method
Our EvalModels also contain some methods that can help create evaluation applications:
model.get_tokens_from_text
model.get_text_from_tokens
model.max_context_size
Furthermore, LLM models have a limited number of tokens that they can pay attention to. We call this limit the context size or context window. You can access the context size of your model via the property max_context_size
. In the following example, we used the model gpt-4-0613
and the context size is
Detailed descriptions of classes and methods related to Phoenix datasets and schemas
A dataset containing a split or cohort of data to be analyzed independently or compared to another cohort. Common examples include training, validation, test, or production datasets.
dataframe (pandas.DataFrame): The data to be analyzed or compared.
name (Optional[str]): The name used to identify the dataset in the application. If not provided, a random name will be generated.
dataframe (pandas.DataFrame): The pandas dataframe of the dataset.
name (str): The name of the dataset.
Define a dataset ds
from a pandas dataframe df
and a schema object schema
by running
Alternatively, provide a name for the dataset that will appear in the application:
Assigns the columns of a pandas dataframe to the appropriate model dimensions (predictions, actuals, features, etc.). Each column of the dataframe should appear in the corresponding schema at most once.
timestamp_column_name (Optional[str]): The name of the dataframe's timestamp column, if one exists. Timestamp columns must be pandas Series with numeric, datetime or object dtypes.
If the timestamp column has numeric dtype (int
or float
), the entries of the column are interpreted as Unix timestamps, i.e., the number of seconds since midnight on January 1st, 1970.
If the column has datetime dtype and contains timezone-naive timestamps, Phoenix assumes those timestamps belong to the local timezone and converts them to UTC.
If the column has datetime dtype and contains timezone-aware timestamps, those timestamps are converted to UTC.
If the column has object dtype having ISO8601 formatted timestamp strings, those entries are converted to datetime dtype UTC timestamps; if timezone-naive then assumed as belonging to local timezone.
feature_column_names (Optional[List[str]]): The names of the dataframe's feature columns, if any exist. If no feature column names are provided, all dataframe column names that are not included elsewhere in the schema and are not explicitly excluded in excluded_column_names
are assumed to be features.
tag_column_names (Optional[List[str]]): The names of the dataframe's tag columns, if any exist. Tags, like features, are attributes that can be used for filtering records of the dataset while using the app. Unlike features, tags are not model inputs and are not used for computing metrics.
prediction_label_column_name (Optional[str]): The name of the dataframe's predicted label column, if one exists. Predicted labels are used for classification problems with categorical model output.
prediction_score_column_name (Optional[str]): The name of the dataframe's predicted score column, if one exists. Predicted scores are used for regression problems with continuous numerical model output.
actual_label_column_name (Optional[str]): The name of the dataframe's actual label column, if one exists. Actual (i.e., ground truth) labels are used for classification problems with categorical model output.
actual_score_column_name (Optional[str]): The name of the dataframe's actual score column, if one exists. Actual (i.e., ground truth) scores are used for regression problems with continuous numerical output.
excluded_column_names (Optional[List[str]]): The names of the dataframe columns to be excluded from the implicitly inferred list of feature column names. This field should only be used for implicit feature discovery, i.e., when feature_column_names
is unused and the dataframe contains feature columns not explicitly included in the schema.
vector_column_name (str): The name of the dataframe column containing the embedding vector data. Each entry in the column must be a list, one-dimensional NumPy array, or pandas Series containing numeric values (floats or ints) and must have equal length to all the other entries in the column.
raw_data_column_name (Optional[str]): The name of the dataframe column containing the raw text associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes a piece of text, for example, in the context of NLP.
link_to_data_column_name (Optional[str]): The name of the dataframe column containing links to images associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes an image, for example, in the context of computer vision.
name (str): The name used to identify the dataset in the application. If not provided, a random name will be generated.
name (Optional[str]): The name used to identify the dataset in the application.
The code snippet below shows how to read data from a trace.jsonl
file into a TraceDataset
, and then pass the dataset to Phoenix through launch_app
. Each line of the trace.jsol
file is a JSON string representing a span.
Detailed descriptions of classes and methods related to Phoenix sessions
Launches and returns a new Phoenix session.
host (Optional[str]): The host on which the server runs. It can also be set using environment variable PHOENIX_HOST
, otherwise it defaults to 127.0.0.1
. Most users don't need to worry this parameter.
port (Optional[int]): The port on which the server listens. It can also be set using environment variable PHOENIX_PORT
, otherwise it defaults to 6060
. This parameter is useful if 6060
is already occupied by a separate application.
run_in_thread (bool): Whether the server should run in a Thread or Process. Defaults to True. This can be turned off if there is a problem starting a thread in a Jupyter Notebook.
default_umap_parameters (Optional Dict[str, Union[int, float]]): default UMAP parameters to use when launching the point-cloud eg: {"n_neighbors": 10, "n_samples": 5, "min_dist": 0.5}
Returns the active Phoenix Session
if one exists, otherwise, returns None
.
Suppose you previously ran
Closes the running Phoenix session, if it exists.
The Phoenix server will continue running in the background until it is explicitly closed, even if the Jupyter server and kernel are stopped.
A session that maintains the state of the Phoenix app. Obtain the active session as follows.
view(height: int = 1000) -> IPython.display.IFrame Displays the Phoenix UI for a running session within an inline frame in the notebook. Parameters
height (int = 1000): The height in pixels of the inline frame element displaying the Phoenix UI within the notebook. Used to adjust the height of the inline frame to the desired height.
start_time (Optional[datetime]): A Python datetime object for filtering spans by time.
stop_time (Optional[datetime]): A Python datetime object for filtering spans by time.
root_spans_only (Optional[bool]): Whether to return only root spans, i.e. spans without parents. Defaults to False
.
url (str): The URL of the running Phoenix session. Can be copied and pasted to open the Phoenix UI in a new browser tab or window.
exports (List[pandas.DataFrame]): A list of pandas dataframes containing exported data, sorted in chronological order. Exports of UMAP cluster data and can be initiated in the clustering UI.
Open the Phoenix UI in an inline frame within your notebook with
You can adjust the height of the inline frame by passing the desired height (number of pixels) to the height
parameter. For example, instead of the line above, run
to open an inline frame of height 1200 pixels.
As an alternative to an inline frame within your notebook, you can open the Phoenix UI in a new browser tab or window by running
and copying and pasting the URL.
Once a cluster or subset of your data is selected in the UI, it can be saved by clicking the "Export" button. You can then access your exported data in your notebook via the exports
property on your session
object, which returns a list of dataframes containing each export.
Exported dataframes are listed in chronological order. To access your most recent export, run
Get spans associated with calls to LLMs.
Get spans associated with calls to retrievers in a Retrieval Augmented Generation use case.
PHOENIX_PORT
The port on which the server listens.
PHOENIX_HOST
The host on which the server listens.
Below is an example of how to set up the port
parameter as an environment variable.
Retrieval Analyzer w/ Embeddings
Traces and Spans
Retrieval Analyzer w/ Embeddings
Traces and Spans
User feedback or
The code snippet below shows how to initialize OpenAIModel
for Azure. Refer to the Azure on how to obtain these value from your Azure deployment.
Find more about the functionality available in our EvalModels in the section.
In this section, we will showcase the methods and properties that our EvalModels
have. First, instantiate your model from the. Once you've instantiated your model
, you can get responses from the LLM by simply calling the model and passing a text string.
[]
schema (): A schema that assigns the columns of the dataframe to the appropriate model dimensions (features, predictions, actuals, etc.).
schema (): The schema of the dataset.
ds
is then passed as the primary
or reference
argument to .
[]
prediction_id_column_name (Optional[str]): The name of the dataframe's prediction ID column, if one exists. Prediction IDs are strings that uniquely identify each record in a Phoenix dataset (equivalently, each row in the dataframe). If no prediction ID column name is provided, Phoenix will automatically generate unique UUIDs for each record of the dataset upon initialization.
If no timestamp column is provided, each record in the dataset is assigned the current timestamp upon initialization.
prompt_column_names (Optional[]): An instance of delineating the column names of an model's prompt embedding vector, prompt text, and optionally links to external resources.
response_column_names (Optional[]): An instance of delineating the column names of an model's response embedding vector, response text, and optionally links to external resources.
embedding_feature_column_names (Optional[Dict[str, ]]): A dictionary mapping the name of each embedding feature to an instance of if any embedding features exist, otherwise, None. Each instance of associates one or more dataframe columns containing vector data, image links, or text with the same embedding feature. Note that the keys of the dictionary are user-specified names that appear in the Phoenix UI and do not refer to columns of the dataframe.
See the guide on how to for examples.
A dataclass that associates one or more columns of a dataframe with an feature. Instances of this class are only used as values in a dictionary passed to the embedding_feature_column_names
field of .
[]
See for recommendations on handling local image files.
See the guide on how to for examples.
Wraps a dataframe that is a flattened representation of spans and traces. Note that it does not require a Schema. See on how to monitor your LLM application using traces. Because Phoenix can also receive traces from your LLM application directly in real time, TraceDataset
is mostly used for loading trace data that has been previously saved to file.
[]
dataframe (pandas.dataframe): a dataframe each row of which is a flattened representation of a span. See for more on traces and spans.
dataframe (pandas.dataframe): a dataframe each row of which is a flattened representation of a span. See for more on traces and spans.
All parameters are optional and launch_app()
launches a Phoenix session with no data and is always ready to receive trace data your LLM applications in real time. See for more.
launch_app
can accept one or two instances as arguments. If the app is launched with a single dataset, Phoenix provides model performance and data quality metrics, but not drift metrics. If the app is launched with two datasets, Phoenix provides drift metrics in addition to model performance and data quality metrics. When two datasets are provided, the reference dataset serves as a baseline against which to compare the primary dataset. Common examples of primary and reference datasets include production vs. training or challenger vs. champion.
[]
primary (Optional[]): The dataset that is of primary interest as the subject of investigation or evaluation.
reference (Optional[]): If provided, the reference dataset serves as a baseline against which to compare the primary dataset.
corpus (Optional[]): If provided, the corpus dataset represents the corpus data from which documents are retrieved in an Retrieval-Augmented Generation (RAG) use case. See for more on how to import this data, and for more bout the use case.
trace (Optional[TraceDataset]): If provided, a trace dataset containing spans. Phoenix can be started with or without a dataset and will always be able to receive traces in real time from your LLM application. See for more.
The newly launched session as an instance of .
Launch Phoenix as a collector of generated by your LLM applications. By default the collector listens on port 6060
.
Launch Phoenix with primary and reference datasets prim_ds
and ref_ds
, both instances of , with
Alternatively, launch Phoenix with a single dataset ds
, an instance of , with
Then session
is an instance of that can be used to open the Phoenix UI in an inline frame within the notebook or in a separate browser tab or window.
[]
without assigning the returned instance to a variable. If you later find that you need access to the running session object, run
Then session
is an instance of that can be used to open the Phoenix UI in an inline frame within your notebook or in a separate browser tab or window.
[]
Suppose you previously launched a Phoenix session with . You can close the running session with
get_spans_dataframe -> pandas.DataFrame Returns spans in a pandas.dataframe. Filters can be applied. See for more about tracing your LLM application. Parameters
filter_condition (Optional[str]): A Python expression for filtering spans. See below for examples.
Phoenix users should not instantiate their own phoenix.Session instances. They interact with this API only when an instance of the class is returned by or .
Launch Phoenix with primary and reference datasets prim_ds
and ref_ds
, both instances of , with
Alternatively, launch Phoenix with a single dataset ds
, an instance of , with
Get all available spans. See on how to trace your LLM applications.
Some settings of the Phoenix can be configured through the environment variables below.
How to create Phoenix datasets and schemas for common data formats
This guide shows you how to define a Phoenix dataset using your own data.
Once you have a pandas dataframe df
containing your data and a schema
object describing the format of your dataframe, you can define your Phoenix dataset either by running
or by optionally providing a name for your dataset that will appear in the UI:
As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.
Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime
objects that represent the time at which each inference was made in production.
2023-03-01 02:02:19
0.91
click
click
2023-02-17 23:45:48
0.37
no_click
no_click
2023-01-30 15:30:03
0.54
click
no_click
2023-02-03 19:56:09
0.74
click
click
2023-02-24 04:23:43
0.37
no_click
click
This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.
Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.
578
Scammeds
4300
62966
RENT
110
0
0
25
male
not_fraud
fraud
507
Schiller Ltd
21000
52335
RENT
129
0
23
78
female
not_fraud
not_fraud
656
Kirlin and Sons
18000
94995
MORTGAGE
31
0
0
54
female
uncertain
uncertain
414
Scammeds
18000
32034
LEASE
81
2
0
34
male
fraud
not_fraud
512
Champlin and Sons
20000
46005
OWN
148
1
0
49
male
uncertain
uncertain
If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names
field of your schema set to its default value of None
, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.
malignant
benign
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488
You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names
field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.
malignant
benign
Pacific Clinics
uninsured
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
Queens Hospital
Anthem Blue Cross
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
St. Francis Memorial Hospital
Blue Shield of CA
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
Pacific Clinics
Kaiser Permanente
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
CityMed
Anthem Blue Cross
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488
Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames
to associate multiple dataframe columns with the same embedding feature.
To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name
field on px.EmbeddingColumnNames
. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:
Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names
is a dictionary.
The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
The values of this dictionary are instances of px.EmbeddingColumnNames
.
Each entry in the "embedding_vector" column is a list of length 4.
fraud
not_fraud
[-0.97, 3.98, -0.03, 2.92]
604
Leannon Ward
22000
100781
RENT
108
0
0
fraud
not_fraud
[3.20, 3.95, 2.81, -0.09]
612
Scammeds
7500
116184
MORTGAGE
42
2
56
not_fraud
not_fraud
[-0.49, -0.62, 0.08, 2.03]
646
Leannon Ward
32000
73666
RENT
131
0
0
not_fraud
not_fraud
[1.69, 0.01, -0.76, 3.64]
560
Kirlin and Sons
19000
38589
MORTGAGE
131
0
0
uncertain
uncertain
[1.46, 0.69, 3.26, -0.17]
636
Champlin and Sons
10000
100251
MORTGAGE
10
0
3
To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.
If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name
field on px.EmbeddingColumnNames
. The following example contains data for an image classification model that detects product defects on an assembly line.
okay
https://www.example.com/image0.jpeg
[1.73, 2.67, 2.91, 1.79, 1.29]
defective
https://www.example.com/image1.jpeg
[2.18, -0.21, 0.87, 3.84, -0.97]
okay
https://www.example.com/image2.jpeg
[3.36, -0.62, 2.40, -0.94, 3.69]
defective
https://www.example.com/image3.jpeg
[2.77, 2.79, 3.36, 0.60, 3.10]
okay
https://www.example.com/image4.jpeg
[1.79, 2.06, 0.53, 3.58, 0.24]
For local image data, we recommend the following steps to serve your images via a local HTTP server:
In your terminal, navigate to a directory containing your image data and run python -m http.server 8000
.
Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.
For example, suppose your HTTP server is running in a directory with the following contents:
Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.
If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name
field on px.EmbeddingColumnNames
. The embeddings below were generated by a sentiment classification model trained on product reviews.
Magic Lamp
Makes a great desk lamp!
[2.66, 0.89, 1.17, 2.21]
office
positive
Ergo Desk Chair
This chair is pretty comfortable, but I wish it had better back support.
[3.33, 1.14, 2.57, 2.88]
office
neutral
Cloud Nine Mattress
I've been sleeping like a baby since I bought this thing.
[2.5, 3.74, 0.04, -0.94]
bedroom
positive
Dr. Fresh's Spearmint Toothpaste
Avoid at all costs, it tastes like soap.
[1.78, -0.24, 1.37, 2.6]
personal_hygiene
negative
Ultra-Fuzzy Bath Mat
Cheap quality, began fraying at the edges after the first wash.
[2.71, 0.98, -0.22, 2.1]
bath
negative
Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.
Magic Lamp
Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.
[2.47, -0.01, -0.22, 0.93]
https://www.example.com/image0.jpeg
[2.42, 1.95, 0.81, 2.60, 0.27]
Ergo Desk Chair
The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.
[-0.25, 0.07, 2.90, 1.57]
https://www.example.com/image1.jpeg
[3.17, 2.75, 1.39, 0.44, 3.30]
Cloud Nine Mattress
Our Cloud Nine Mattress combines cool comfort with maximum affordability.
[1.36, -0.88, -0.45, 0.84]
https://www.example.com/image2.jpeg
[-0.22, 0.87, 1.10, -0.78, 1.25]
Dr. Fresh's Spearmint Toothpaste
Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula
[-0.39, 1.29, 0.92, 2.51]
https://www.example.com/image3.jpeg
[1.95, 2.66, 3.97, 0.90, 2.86]
Ultra-Fuzzy Bath Mat
The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom.
[0.37, 3.22, 1.29, 0.65]
https://www.example.com/image4.jpeg
[0.77, 1.79, 0.52, 3.79, 0.47]
For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of datasets and schemas, see .
For a comprehensive description of phoenix.Dataset
and phoenix.Schema
, see the .
For a conceptual overview of embeddings, see .
For a comprehensive description of px.EmbeddingColumnNames
, see the .
The features in this example are to be the columns of the dataframe that do not appear in the schema.