1 of 51

English

Phoenix: AI Observability & Evaluation

Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.

Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.

The toolset is designed to ingest for , CV, NLP, and tabular datasets as well as . It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues & insights, and easily export to improve.

Quickstarts

Running Phoenix for the first time? Select a quickstart below.

Don't know which one to choose? Phoenix has two main data ingestion methods:

Phoenix is used on top of trace data generated by LlamaIndex and LangChain. The general use case is to troubleshoot LLM applications with agentic workflows.
: Phoenix is used to troubleshoot models whose datasets can be expressed as DataFrames in Python such as LLM applications built in Python workflows, CV, NLP, and tabular models.

Phoenix Functionality

Use the Phoenix Evals library to easily evaluate tasks such as hallucination, summarization, and retrieval relevance, or create your own custom template.
Get visibility into where your complex or agentic workflow broke, or find performance bottlenecks, across different span types with LLM Tracing.
Identify missing context in your knowledge base, and when irrelevant context is retrieved by visualizing query embeddings alongside knowledge base embeddings with RAG Analysis.
Compare and evaluate performance across model versions prior to deploying to production.
Connect teams and workflows, with continued analysis of production data from Arize in a notebook environment for fine tuning workflows.
Find clusters of problems using performance metrics or drift. Export clusters for retraining workflows.
Use the Embeddings Analyzer to surface data drift for computer vision, NLP, and tabular models.

Resources

Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.

Learn about best practices, and how to get started with use case examples such as Q&A with Retrieval, Summarization, and Chatbots.

Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.

Quickstart

AutoGen Support

AutoGen is a new agent framework from Microsoft that allows for complex Agent creation. It is unique in its ability to create multiple agents that work together.

The AutoGen Agent framework allows creation of multiple agents and connection of those agents to work together to accomplish tasks.

from phoenix.trace.tracer import Tracer
from phoenix.trace.openai.instrumentor import OpenAIInstrumentor
from phoenix.trace.exporter import HttpExporter
from phoenix.trace.openai import OpenAIInstrumentor
from phoenix.trace.tracer import Tracer

import phoenix as px
session = px.launch_app()
tracer = Tracer(exporter=HttpExporter())
OpenAIInstrumentor(tracer).instrument()

The Phoenix support is simple in its first incarnation but allows for capturing all of the prompt and responses that occur under the framework between each agent.

The individual prompt and responses are captured directly through OpenAI calls.

As callbacks are supported in AutoGen Phoenix will add more agent level information.

Concepts

What is LLM Observability?

LLM observability is complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.

5 Pillars of LLM Observability

- This helps you evaluate how well the response answers the prompt by using a separate evaluation LLM.
- This gives you visibility into where more complex or agentic workflows broke.
- Iterating on a prompt template can help improve LLM results.
- Improving the context that goes into the prompt can lead to better LLM responses.
- Fine-tuning generates a new model that is more aligned with your exact usage conditions for improved performance.

1. LLM Evals

Evaluation is a measure of how well the response answers the prompt.

There are several ways to evaluate LLMs:

You can collect the feedback directly from your users. This is the simplest way but can often suffer from users not being willing to provide feedback or simply forgetting to do so. Other challenges arise from implementing this at scale.
The other approach is to use an LLM to evaluate the quality of the response for a particular prompt. This is more scalable and very useful but comes with typical LLM setbacks.

Learn more about library.

2. LLM Traces and Spans

For more complex or agentic workflows, it may not be obvious which call in a span or which span in your trace (a run through your entire use case) is causing the problem. You may need to repeat the evaluation process on several spans before you narrow down the problem.

This pillar is largely about diving deep into the system to isolate the issue you are investigating.

Learn more about support.

3. Prompt Engineering

Prompt engineering is the cheapest, fastest, and often the highest-leverage way to improve the performance of your application. Often, LLM performance can be improved simply by comparing different prompt templates, or iterating on the one you have. Prompt analysis is an important component in troubleshooting your LLM's performance.

Learn about in Arize.

4. Search and Retrieval

A common way to improve performance is with more relevant information being fed in.

If you can retrieve more relevant information, your prompt improves automatically. Troubleshooting retrieval systems, however, is more complex. Are there queries that don’t have sufficient context? Should you add more context for these queries to get better answers? Or should you change your embeddings or chunking strategy?

Learn more about with Phoenix.

5. Fine Tuning

Fine tuning essentially generates a new model that is more aligned with your exact usage conditions. Fine tuning is expensive, difficult, and may need to be done again as the underlying LLM or other conditions of your system change. This is a very powerful technique, requires much higher effort and complexity.

LLM Traces

Tracing the execution of LLM powered applications using OpenInference Traces

What is LLM Traces and Observability?

The rise of LangChain and LlamaIndex for LLM app development has enabled developers to move quickly in building applications powered by LLMs. The abstractions created by these frameworks can accelerate development, but also make it hard to debug the LLM app. Take the below example where a RAG application be written in a few lines of code but in reality has a very complex run tree.

LLM Traces and Observability lets us understand the system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?”

Phoenix's tracing module is the mechanism by which application code is instrumented, to help make a system observable.

Let's dive into the fundamental building block of traces: the span.

Spans

A span represents a unit of work or operation (think a span of time). It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.

A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks. A span for an LLM execution in JSON format is displayed below

{
    "name": "llm",
    "context": {
        "trace_id": "ed7b336d-e71a-46f0-a334-5f2e87cb6cfc",
        "span_id": "ad67332a-38bd-428e-9f62-538ba2fa90d4"
    },
    "span_kind": "LLM",
    "parent_id": "f89ebb7c-10f6-4bf8-8a74-57324d2556ef",
    "start_time": "2023-09-07T12:54:47.597121-06:00",
    "end_time": "2023-09-07T12:54:49.321811-06:00",
    "status_code": "OK",
    "status_message": "",
    "attributes": {
        "llm.input_messages": [
            {
                "message.role": "system",
                "message.content": "You are an expert Q&A system that is trusted around the world.\nAlways answer the query using the provided context information, and not prior knowledge.\nSome rules to follow:\n1. Never directly reference the given context in your answer.\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines."
            },
            {
                "message.role": "user",
                "message.content": "Hello?"
            }
        ],
        "output.value": "assistant: Yes I am here",
        "output.mime_type": "text/plain"
    },
    "events": [],
}

Spans can be nested, as is implied by the presence of a parent span ID: child spans represent sub-operations. This allows spans to more accurately capture the work done in an application.

Traces

A trace records the paths taken by requests (made by an application or end-user) as they propagate through multiple steps.

Without tracing, it is challenging to pinpoint the cause of performance problems in a system.

It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for LLM applications, which commonly have nondeterministic problems or are too complicated to reproduce locally.

Tracing makes debugging and understanding LLM applications less daunting by breaking down what happens within a request as it flows through a system.

A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).

Span Kind

When a span is created, it is created as one of the following: Chain, Retriever, Reranker, LLM, Embedding, Agent, or Tool.

CHAIN

A Chain is a starting point or a link between different LLM application steps. For example, a Chain span could be used to represent the beginning of a request to an LLM application or the glue code that passes context from a retriever to and LLM call.

RETRIEVER

A Retriever is a span that represents a data retrieval step. For example, a Retriever span could be used to represent a call to a vector store or a database.

RERANKER

A Reranker is a span that represents the reranking of a set of input documents. For example, a cross-encoder may be used to compute the input documents' relevance scores with respect to a user query, and the top K documents with the highest scores are then returned by the Reranker.

LLM

An LLM is a span that represents a call to an LLM. For example, an LLM span could be used to represent a call to OpenAI or Llama.

EMBEDDING

An Embedding is a span that represents a call to an LLM for an embedding. For example, an Embedding span could be used to represent a call OpenAI to get an ada-2 embedding for retrieval.

TOOL

A Tool is a span that represents a call to an external tool such as a calculator or a weather API.

AGENT

A span that encompasses calls to LLMs and Tools. An agent describes a reasoning block that acts on tools using the guidance of an LLM.

Attributes

Attributes are key-value pairs that contain metadata that you can use to annotate a span to carry information about the operation it is tracking.

For example, if a span invokes an LLM, you can capture the model name, the invocation parameters, the token count, and so on.

Attributes have the following rules:

Keys must be non-null string values

Embeddings Analysis

Embedding Details

Embedding Drift Over Time

The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your dataset is diverging from a given reference baseline.

Moments of high euclidean distance is an indication that the primary dataset is starting to drift from the reference dataset. As the primary dataset moves further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.

In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.

Clusters

When two datasets are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary dataset than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.

UMAP Point-Cloud

LLM Evals

Phoenix LLM Evals

The Problem with LLM Evaluations

Most evaluation libraries do NOT follow trustworthy benchmarking rigor necessary for production environments. Production LLM Evals need to benchmark both a model and "a prompt template". (i.e. the Open AI “model” Evals only focuses on evaluating the model, a different use case).
There is typically difficulty integrating benchmarking, development, production, or the LangChain/LlamaIndex callback system. Evals should process batches of data with optimal speed.
Obligation to use chain abstractions (i.e. LangChain shouldn't be a prerequisite for obtaining evaluations for pipelines that don't utilize it).

Our Solution: Phoenix LLM Evals

1. Support for Pre-Tested Eval Templates & custom eval templates

2. Data Science Rigor when Benchmarking Evals for Reproducible Results

3. Designed for Throughput

Phoenix evals are designed to run as fast as possible on batches of Eval data and maximize the throughput and usage of your API key. The current Phoenix library is 10x faster in throughput than current call-by-call-based approaches integrated into the LLM App Framework Evals.

4. Run the Same Evals in Different Environments (Notebooks, python pipelines, Langchain/LlamaIndex callbacks)

Phoenix Evals are designed to run on dataframes, in Python pipelines, or in LangChain & LlamaIndex callbacks. Evals are also supported in Python pipelines for normal LLM deployments not using LlamaIndex or LangChain. There is also one-click support for Langchain and LlamaIndx support.

5. Run Evals on Span and Chain Level

Evals are supported on a span level for LangChain and LlamaIndex.

Hallucinations

When To Use Hallucination Eval Template

This LLM Eval detects if the output of a model is a hallucination based on contextual data.

This Eval is designed specifically designed for hallucinations relative to private or retrieved data, is an answer to a question a hallucination based on a set of contextual data.

Hallucination Eval Template

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

Benchmark Results

GPT-4 Results

GPT-3.5 Results

Claud v2 Results

How To Run the Eval

The above Eval shows how to the the hallucination template for Eval detection.

Q&A on Retrieved Data

When To Use Q&A Eval Template

This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

question: This is the question the Q&A system is running against
sampled_answer: This is the answer from the Q&A system.
context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer

Q&A Eval Template

You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference]: {context}
    ************
    [Answer]: {sampled_answer}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

Benchmark Results

GPT-4 Results

GPT-3.5 Results

Claude V2 Results

How To Run the Eval

import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.experimental.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails fore the output to specific values of the template
#It will remove text such as ",,," or "...", anything not the
#binary value expected from the template
rails = list(templates.QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.QA_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

The above Eval uses the QA template for Q&A analysis on retrieved data.

Q&A Eval

GPT-4

GPT-3.5

GPT-3.5-turbo-instruct

Palm (Text Bison)

Claude V2

Precision

0.99

0.42

1.0

Recall

0.92

0.83

0.94

0.64

Precision

0.96

0.90

0.59

0.97

0.78

Code Generation Eval

When To Use Code Generation Eval Template

This Eval checks the correctness and readability of the code from a code generation process. The template variables are:

query: The query is the coding question being asked
code: The code is the code that was returned.

Code Generation Eval Template

You are a stern but practical senior software engineer who cares a lot about simplicity and
readability of code. Can you review the following code that was written by another engineer?
Focus on readability of the code. Respond with "readable" if you think the code is readable,
or "unreadable" if the code is unreadable or needlessly complex for what it's trying
to accomplish.

ONLY respond with "readable" or "unreadable"

Task Assignment:
```
{query}
```

Implementation to Evaluate:
```
{code}
```

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

Benchmark Results

GPT-4 Results

GPT-3.5 Results

How To Run the Eval

from phoenix.experimental.evals import (
    CODE_READABILITY_PROMPT_RAILS_MAP,
    CODE_READABILITY_PROMPT_TEMPLATE_STR,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(CODE_READABILITY_PROMPT_RAILS_MAP.values())
readability_classifications = llm_classify(
    dataframe=df,
    template=CODE_READABILITY_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

The above shows how to use the code readability template.

Code Eval

GPT-4

GPT-3.5

GPT-3.5-Instruct

Palm 2 (Text Bison)

Llama 7b (soon)

Precision

0.93

0.76

0.67

0.77

Recall

0.78

0.93

0.94

0.85

0.81

0.85

Summarization Eval

When To Use Summarization Eval Template

This Eval helps evaluate the summarization results of a summarization task. The template variables are:

document: The document text to summarize
summary: The summary of the document

Summarization Eval Template

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

Benchmark Results

GPT-4 Results

GPT-3.5 Results

Claud V2 Results

How To Run the Eval

The above shows how to use the summarization Eval template.

Building Your Own Evals

Customize Your Own Eval Templates

The LLM Evals library is designed to support the building of any custom Eval templates.

Steps to Building Your Own Eval

Follow the following steps to easily build your own Eval with Phoenix

1. Choose a Metric

To do that, you must identify what is the metric best suited for your use case. Can you use a pre-existing template or do you need to evaluate something unique to your use case?

2. Build a Golden Dataset

Then, you need the golden dataset. This should be representative of the type of data you expect the LLM eval to see. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.

Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above)

The Evals dataset is designed or easy benchmarking and pre-set downloadable test datasets. The datasets are pre-tested, many are hand crafted and designed for testing specific Eval tasks.

from phoenix.experimental.evals import download_benchmark_dataset

df = download_benchmark_dataset(
    task="binary-hallucination-classification", dataset_name="halueval_qa_data"
)
df.head()

3. Decide Which LLM to use For Evaluation

Then you need to decide which LLM you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

4. Build the Eval Template

Now comes the core component that we are trying to benchmark and improve: the eval template.

You can adjust an existing template or build your own from scratch.

Be explicit about the following:

What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

In order to create a new template all that is needed is the setting of the input string to the Eval function.

MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

The above template shows an example creation of an easy to use string template. The Phoenix Eval templates support both strings and objects.


model = OpenAIModel(model_name="gpt-4",temperature=0.6)
positive_eval = llm_classify(
    dataframe=df,
    template= MY_CUSTOM_TEMPLATE,
    model=model
)

The above example shows a use of the custom created template on the df dataframe.

#Phoenix Evals support using either stirngs or objects as templates
MY_CUSTOM_TEMPLATE = " ..."
MY_CUSTOM_TEMPLATE = PromptTemplate("This is a test {prompt}")

5. Run Eval on your Golden Dataset and Benchmark Performance

You now need to run the eval across your golden dataset. Then you can generate metrics (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.

Benchmarking Retrieval (RAG)

Benchmarking Chunk Size, K and Retrieval Approach

The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.

The core use case for RAG (Retrieval Augmented Generation) is the connecting of an LLM to private data, empower an LLM to know your data and respond based on the private data you fit into the context window.

As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.

The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.

This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.

This overview will run through the scripts in phoenix for performance analysis of RAG setup:

The scripts above power the included notebook.

Retrieval Performance Analysis

The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.

The core issue of retrieval performance: The chunks returned might or might not be able to answer your main question. They might be semantically similar but not usable to create an answer the question!

The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?

The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.

The Evals generated on each chunk can then be used to generate more traditional search and retreival metrics for the retrieval system. We highly recommend that teams at least look at traditional search and retrieval metrics such as:

MRR
Precision @ K
NDCG

These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.

These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.

Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.

Q&A Evals

The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.

The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.

The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.

Results

The results from the runs will be available in the directory:

experiment_data/

Underneath experiment_data there are two sets of metrics:

The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.

experiment_data/results_zero_removed

The second set of results is unfiltered and shows the raw metrics for every retrieval.

experiment_data/results_zero_not_removed

The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.

The above graphs show MRR results across a sweep of different chunk sizes.

Use Cases

How-To

Install and Import Phoenix

How to fly with Phoenix

In your Jupyter or Colab environment, run the following command to install.

pip install arize-phoenix

conda install -c conda-forge arize-phoenix

pip install arize-phoenix[experimental]

Once installed, import Phoenix in your notebook with

import phoenix as px

Phoenix is supported on Python ≥3.8, <3.11.

Prompt and Response (LLM)

How to import prompt and response from Large Large Model (LLM)

Dataframe

Below shows a relevant subsection of the dataframe. The embedding of the prompt is also shown.

Schema

Dataset

Define the dataset by pairing the dataframe with the schema.

Application

Retrieval (RAG)

How to import data for the Retrieval-Augmented Generation (RAG) use case

Dataframe

query

embedding

retrieved_document_ids

relevance_scores

who was the first person that walked on the moon

[-0.0126, 0.0039, 0.0217, ...

[7395, 567965, 323794, ...

[11.30, 7.67, 5.85, ...

who was the 15th prime minister of australia

[0.0351, 0.0632, -0.0609, ...

[38906, 38909, 38912, ...

[11.28, 9.10, 8.39, ...

why is amino group in aniline an ortho para di...

[-0.0431, -0.0407, -0.0597, ...

[779579, 563725, 309367, ...

[-10.89, -10.90, -10.94, ...

Schema

Both the retrievals and scores are grouped under prompt_column_names along with the embedding of the query.

primary_schema = Schema(
    prediction_id_column_name="id",
    prompt_column_names=RetrievalEmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="query",
        context_retrieval_ids_column_name="retrieved_document_ids",
        context_retrieval_scores_column_name="relevance_scores",
    )
)

Dataset

Define the dataset by pairing the dataframe with the schema.

primary_dataset = px.Dataset(primary_dataframe, primary_schema)

Application

session = px.launch_app(primary_dataset)

Corpus Data

How to create Phoenix datasets and schemas for the corpus data

Dataframe

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

Dataset

Define the dataset by pairing the dataframe with the schema.

Application

Manage the App

How to define your dataset(s), launch a session, open the UI in your notebook or browser, and close your session when you're done

Define Your Dataset(s)

prim_ds = px.Dataset(prim_df, prim_schema, "primary")

If you additionally have a dataframe ref_df and a matching ref_schema, you can define a dataset named "reference" with

ref_ds = px.Dataset(ref_df, ref_schema, "reference")

Launch the App

Use phoenix.launch_app to start your Phoenix session in the background. You can launch Phoenix with zero, one, or two datasets.

Open the UI

You can view and interact with the Phoenix UI either directly in your notebook or in a separate browser tab or window.

In a notebook cell, run

session.url

Copy and paste the output URL into a new browser tab or window.

Browser-based sessions are supported in both local Jupyter environments and Colab.

In a notebook cell, run

session.view()

The Phoenix UI will appear in an inline frame in the cell output.

The height of the window can be adjusted by passing a height parameter, e.g., session.view(height=1200). Defaults to 1000 pixels.

Close the App

When you're done using Phoenix, gracefully shut down your running background session with

px.close_app()

Export Your Data

How to export your data for labeling, evaluation, or fine-tuning

Phoenix is designed to be a pre-production tool that can be used to find interesting or problematic data that can be used for various use-cases:

A subset of production data for re-labeling and training
A subset of data for fine-tuning an LLM

Exporting Traces

The easiest way to gather traces that have been collected by Phoenix is to directly pull a dataframe of the traces from your Phoenix session object.

px.active_session().get_spans_dataframe('span_kind == "RETRIEVER"')

You can also directly get the spans from the tracer or callback:

from phoenix.trace.langchain import OpenInferenceTracer

tracer = OpenInferenceTracer()

# Run the application with the tracer
chain.run(query, callbacks=[tracer])

# When you are ready to analyze the data, you can convert the traces
ds = TraceDataset.from_spans(tracer.get_spans())

# Print the dataframe
ds.dataframe.head()

# Re-initialize the app with the trace dataset
px.launch_app(trace=ds)

Note that the above calls get_spans on a LangChain tracer but the same exact method exists on the OpenInferenceCallback for LlamaIndex as well.

Exporting Embeddings

Embeddings can be extremely useful for fine-tuning. There are two ways to export your embeddings from the Phoenix UI.

Export Selected Clusters

To export a cluster (either selected via the lasso tool or via a the cluster list on the right hand panel), click on the export button on the top left of the bottom slide-out.

Export All Clusters

session = px.active_session()
session.exports[-1].dataframe

Use Example Datasets

Quickly explore Phoenix with concrete examples

Phoenix ships with a collection of examples so you can quickly try out the app on concrete use-cases. This guide shows you how to download, inspect, and launch the app with example datasets.

View Available Datasets

To see a list of datasets available for download, run

px.load_example?

This displays the docstring for the phoenix.load_example function, which contain a list of datasets available for download.

Download Your Dataset of Choice

Choose the name of a dataset to download and pass it as an argument to phoenix.load_example. For example, run the following to download production and training data for our demo sentiment classification model:

datasets = px.load_example("sentiment_classification_language_drift")
datasets

px.load_example returns your downloaded data in the form of an ExampleDatasets instance. After running the code above, you should see the following in your cell output.

ExampleDatasets(primary=<Dataset "sentiment_classification_language_drift_primary">, reference=<Dataset "sentiment_classification_language_drift_reference">)

Inspect Your Datasets

Next, inspect the name, dataframe, and schema that define your primary dataset. First, run

prim_ds = datasets.primary
prim_ds.name

to see the name of the dataset in your cell output:

'sentiment_classification_language_drift_primary'

Next, run

prim_ds.schema

to see your dataset's schema in the cell output:

Schema(prediction_id_column_name='prediction_id', timestamp_column_name='prediction_ts', feature_column_names=['reviewer_age', 'reviewer_gender', 'product_category', 'language'], tag_column_names=None, prediction_label_column_name='pred_label', prediction_score_column_name=None, actual_label_column_name='label', actual_score_column_name=None, embedding_feature_column_names={'text_embedding': EmbeddingColumnNames(vector_column_name='text_vector', raw_data_column_name='text', link_to_data_column_name=None)}, excluded_column_names=None)

Last, run

prim_ds.dataframe.info()

to get an overview of your dataset's underlying dataframe:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 33411 entries, 2022-05-01 07:00:16+00:00 to 2022-06-01 07:00:16+00:00
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   prediction_ts     33411 non-null  datetime64[ns, UTC]
 1   reviewer_age      33411 non-null  int16
 2   reviewer_gender   33411 non-null  object
 3   product_category  33411 non-null  object
 4   language          33411 non-null  object
 5   text              33411 non-null  object
 6   text_vector       33411 non-null  object
 7   label             33411 non-null  object
 8   pred_label        33411 non-null  object
 9   prediction_id     0 non-null      object
dtypes: datetime64[ns, UTC](1), int16(1), object(8)
memory usage: 2.6+ MB

Launch the App

Launch Phoenix with

px.launch_app(datasets.primary, datasets.reference)

Follow the instructions in the cell output to open the Phoenix UI in your notebook or in a separate browser tab.

View Available Traces

px.load_example_traces?

# Load up the LlamaIndex RAG example
px.launch_app(trace=px.load_example_traces("llama_index_rag"))

Contribute to Phoenix

If you want to contribute to the cutting edge of LLM and ML Observability, you've come to the right place!

To get started, please check out the following:

Picking a GitHub Issue

Submit Your Code

In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.

A Core reviewer will review your PR in around one business day and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github 🥳.

Your PR is now merged into Phoenix! We’ll shout out your contribution in the release notes.

API

INTEGRATIONS

LangChain

Extract OpenInference inferences and traces to visualize and troubleshoot your LLM Application in Phoenix

Traces

Traces provide telemetry data about the execution of your LLM application. They are a great way to understand the internals of your LangChain application and to troubleshoot problems related to things like retrieval and tool execution.

To extract traces from your LangChain application, you will have to add Phoenix's OpenInference Tracer to your LangChain application. A tracer is a class that automatically accumulates traces (sometimes referred to as spans) as your application executes. The OpenInference Tracer is a tracer that is specifically designed to work with Phoenix and by default exports the traces to a locally running phoenix server.

To view traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:

Once you have started a Phoenix server, you can start your LangChain application with the OpenInference Tracer as a callback. There are two ways of adding the `tracer` to your LangChain application - by instrumenting all your chains in one go (recommended) or by adding the tracer to as a callback to just the parts that you care about (not recommended).

By adding the tracer to the callbacks of LangChain, we've created a one-way data connection between your LLM application and Phoenix. This is because by default the OpenInferenceTracer uses an HTTPExporter to send traces to your locally running Phoenix server! In this scenario the Phoenix server is serving as a Collector of the spans that are exported from your LangChain application.

To view the traces in Phoenix, simply open the UI in your browser.

Saving Traces

If you would like to save your traces to a file for later use, you can directly extract the traces from the tracer

To directly extract the traces from the tracer, dump the traces from the tracer into a file (we recommend jsonl for readability).

Now you can save this file for later inspection. To launch the app with the file generated above, simply pass the contents in the file above via a TraceDataset

In this way, you can use files as a means to store and communicate interesting traces that you may want to use to share with a team or to use later down the line to fine-tune an LLM or model.

Working Example with Traces

For a fully working example of tracing with LangChain, checkout our colab notebook.

Inferences

Phoenix supports visualizing LLM application inference data from a LangChain application. In particular you can use Phoenix's embeddings projection and clustering to troubleshoot retrieval-augmented generation. For a tutorial on how to extract embeddings and inferences from LangChain, check out the following notebook.

OpenAI

Instrument calls to the OpenAI Python Library

Traces

Phoenix currently supports calls to the ChatCompletion interface, but more are planned soon.

To view OpenInference traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:

import phoenix as px
session = px.launch_app()

Once you have started a Phoenix server, you can instrument the openai Python library using the OpenAIInstrumentor class.

from phoenix.trace.tracer import Tracer
from phoenix.trace.exporter import HttpExporter
from phoenix.trace.openai.instrumentor import OpenAIInstrumentor


tracer = Tracer(exporter=HttpExporter())
OpenAIInstrumentor(tracer).instrument()

All subsequent calls to the ChatCompletion interface will now report informational spans to Phoenix. These traces and spans are viewable within the Phoenix UI.

# View in the browser
px.active_session().url

# View in the notebook directy
px.active_session().view()

Saving Traces

If you would like to save your traces to a file for later use, you can directly extract the traces from the tracer

To directly extract the traces from the tracer, dump the traces from the tracer into a file (we recommend jsonl for readability).

from phoenix.trace.span_json_encoder import spans_to_jsonl
with open("trace.jsonl", "w") as f:
    f.write(spans_to_jsonl(tracer.get_spans()))

Now you can save this file for later inspection. To launch the app with the file generated above, simply pass the contents in the file above via a TraceDataset

from phoenix.trace.utils import json_lines_to_df

json_lines = []
with open("trace.jsonl", "r") as f:
        json_lines = cast(List[str], f.readlines())
trace_ds = TraceDataset(json_lines_to_df(json_lines))
px.launch_app(trace=trace_ds)

In this way, you can use files as a means to store and communicate interesting traces that you may want to use to share with a team or to use later down the line to fine-tune an LLM or model.

Arize

Easily share data when you discover interesting insights so your data science team can perform further investigation or kickoff retraining workflows.

Oftentimes, the team that notices an issue in their model, for example a prompt/response LLM model, may not be the same team that continues the investigations or kicks off retraining workflows.

With a few lines of Python code, users can export this data into Phoenix for further analysis. This allows team members, such as data scientists, who may not have access to production data today, an easy way to access relevant product data for further analysis in an environment they are familiar with.

They can then easily augment and fine tune the data and verify improved performance, before deploying back to production.

Reference

Embeddings

Meaning, Examples and How To Compute

What's an embedding?

Embeddings are vector representations of information. (e.g. a list of floating point numbers). With embeddings, the distance between two vectors carry semantic meaning: Small distances suggest high relatedness and large distances suggest low relatedness. Embeddings are everywhere in modern deep learning, such as transformers, recommendation engines, layers of deep neural networks, encoders, and decoders.

A simple example: In an image, a color can be represented as the amount of red, green, blue, and transparency in the form of rgba(255, 255, 255, 0). This vector [255, 255, 255, 0] not only encodes information(the color white) but it carries meaning in space as well. Colors more similar to white are closer to the vector and points farther from this vector are less similar (e.x. black is `[0, 0, 0, 0]`).

Why embeddings

Embeddings are foundational to machine learning because:

Embeddings can represent various forms of data such as images, audio signals, and even large chunks of structured data.
They provide a common mathematical representation of your data
They compress data
They preserve relationships within your data
They are the output of deep learning layers providing comprehensible linear views into complex non-linear relationships learned by models

How to generate embeddings

Embedding vectors are generally extracted from the activation values of one or many hidden layers of your model. In general, there are many ways of obtaining embedding vectors, including:

Word embeddings
Autoencoder Embeddings
Generative Adversarial Networks (GANs)
Pre-trained Embeddings

Once you have chosen a model to generate embeddings, the question is: how? Here are few use-case based examples. In each example you will notice that the embeddings are generated such that the resulting vector represents your input according to your use case.

If you are working on image classification, the model will take an image and classify it into a given set of categories. Each of our embedding vectors should be representative of the corresponding entire image input.

First, we need to use a feature_extractor that will take an image and prepare it for the large pre-trained image model.

Then, we pass the results from the feature_extractor to our model. In PyTorch, we use torch.no_grad() since we don't need to compute the gradients for backward propagation, we are not training the model in this example.

It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.

Finally, since we want the embedding vector to represent the entire image, we will average across the second dimension, representing the areas of the image.

If you are working on NLP sequence classification (for example, sentiment classification), the model will take a piece of text and classify it into a given set of categories. Hence, your embedding vector must represent the entire piece of text.

For this example, let us assume we are working with a model from the BERT family.

First, we must use a tokenizer that will the text and prepare it for the pre-trained large language model (LLM).

Then, we pass the results from the tokenizer to our model. In PyTorch, we use torch.no_grad() since we don't need to compute the gradients for backward propagation, we are not training the model in this example.

Finally, since we want the embedding vector to represent the entire piece of text for classification, we will use the vector associated with the classification token,[CLS], as our embedding vector.

If you are working on NLP Named Entity Recognition (NER), the model will take a piece of text and classify some words within it into a given set of entities. Hence, each of your embedding vectors must represent a classified word or token.

For this example, let us assume we are working with a model from the BERT family.

First, we must use a tokenizer that will the text and prepare it for the pre-trained large language model (LLM).

Further, since we want the embedding vector to represent any given token, we will use the vector associated with a specific token in the piece of text as our embedding vector. So, let token_index be the integer value that locates the token of interest in the list of tokens that result from passing the piece of text to the tokenizer. Let ex_index the integer value that locates a given example in the batch. Then,

Architecture

Learn how Phoenix fits into your ML stack and how to incorporate Phoenix into your workflows.

Phoenix is designed to run locally on a single server in conjunction with the Notebook.

Phoenix runs locally, close to your data, in an environment that interfaces to Notebook cells on the Notebook server. Designing Phoenix to run locally, enables fast iteration on top of local data.

How should I use Phoenix?

In order to use Phoenix:

Load data into pandas dataframe
Start Phoenix
1. Single dataframe
Investigate problems
(Optional) Export data

Load Data Into pandas:

Leverage SDK Embeddings and LLM Eval Generators:

Start Phoenix with DataFrames:

Phoenix is typically started in a notebook from which a local Phoenix server is kicked off. Two approaches can be taken to the overall use of Phoenix:

Single Dataset

In the case of a team that only wants to investigate a single dataset for exploratory data analysis (EDA), a single dataset instantiation of Phoenix can be used. In this scenario, a team is normally analyzing the data in an exploratory manner and is not doing A/B comparisons.

Two Datasets

A common use case in ML is for teams to have 2x datasets they are comparing such as: training vs production, model A vs model B, OR production time X vs production time Y, just to name a few. In this scenario there exists a primary and reference dataset. When using the primary and reference dataset, Phoenix supports drift analysis, embedding drift and many different A/B dataset comparisons.

Investigate Problems:

Once instantiated, teams can dive into Phoenix on a feature by feature basis, analyzing performance and tracking down issues.

Export Cluster:

Once an issue is found, the cluster can be exported back into a dataframe for further analysis. Clusters can be used to create groups of similar data points for use downstream, these include:

Finding Similar Examples
Monitoring
Steering Vectors / Steering Prompts

How Phoenix fits into the ML Stack

The above picture shows the use of Phoenix with a cloud observability system (this is not required). In this example the cloud observability system allows the easy download (or synchronization) of data to the Notebook typically based on model, batch, environment, and time ranges. Normally this download is done to analyze data at the tail end of troubleshooting workflow, or periodically to use the notebook environment to monitor your models.

Once in a notebook environment the downloaded data can power Observability workflows that are highly interactive. Phoenix can be used to find clusters of data problems and export those clusters back to the Observability platform for use in monitoring and active learning workflows.

Note: Data can also be downloaded from any data warehouse system for use in Phoenix without the requirement of a cloud ML observability solution.

In the first version of Phoenix it is assumed the data is available locally but we’ve also designed it with some broader visions in mind. For example, Phoenix was designed with a stateless metrics engine as a first class citizen, enabling any metrics checks to be run in any python data pipeline.

Frequently Asked Questions

Can I configure a default port for Phoenix?

Can I use Phoenix locally from a remote Jupyter instance?

Yes, you can use either of the two methods below.

1. Via ngrok (Preferred)

Install pyngrok on the remote machine using the command pip install pyngrok.
In jupyter notebook, after launching phoenix set its port number as the port parameter in the code below. Preferably use a default port for phoenix so that you won't have to set up ngrok tunnel every time for a new port, simply restarting phoenix will work on the same ngrok URL.

import getpass
from pyngrok import ngrok, conf
print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = getpass.getpass()
port = 37689
# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(port).public_url
print(" * ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}\"".format(public_url, port))

"Visit Site" using the newly printed public_url and ignore warnings, if any.

NOTE:

Ngrok free account does not allow more than 3 tunnels over a single ngrok agent session. Tackle this error by checking active URL tunnels using ngrok.get_tunnels() and close the required URL tunnel using ngrok.disconnect(public_url).

2. Via SSH

This assumes you have already set up ssh on both the local machine and the remote server.

If you are accessing a remote jupyter notebook from a local machine, you can also access the phoenix app by forwarding a local port to the remote server via ssh. In this particular case of using phoenix on a remote server, it is recommended that you use a default port for launching phoenix, say DEFAULT_PHOENIX_PORT.

Launch the phoenix app from jupyter notebook.
In a new terminal or command prompt, forward a local port of your choice from 49152 to 65535 (say 52362) using the command below. Remote user of the remote host must have sufficient port-forwarding/admin privileges.
```
ssh -L 52362:localhost:<DEFAULT_PHOENIX_PORT> <REMOTE_USER>@<REMOTE_HOST>
```

If you are abruptly unable to access phoenix, check whether the ssh connection is still alive by inspecting the terminal. You can also try increasing the ssh timeout settings.

Closing ssh tunnel:

Simply run exit in the terminal/command prompt where you ran the port forwarding command.

Evals

Evals are LLM-powered functions that you can use to evaluate the output of your LLM or generative application

Evals are still under experimental and must be installed via pip install arize-phoenix[experimental]

phoenix.experimental.evals.PromptTemplate

class PromptTemplate(
    text: str
    delimiters: List[str]
)

Class used to store and format prompt templates.

Parameters

text (str): The raw prompt text used as a template.
delimiters (List[str]): List of characters used to locate the variables within the prompt template text. Defaults to ["{", "}"].

Attributes

text (str): The raw prompt text used as a template.
variables (List[str]): The names of the variables that, once their values are substituted into the template, create the prompt text. These variable names are automatically detected from the template text using the delimiters passed when initializing the class (see Usage section below).

Usage

Define a PromptTemplate by passing a text string and the delimiters to use to locate the variables. The default delimiters are { and }.

from phoenix.experimental.evals import PromptTemplate

template_text = "My name is {name}. I am {age} years old and I am from {location}."
prompt_template = PromptTemplate(text=template_text)

If the prompt template variables have been correctly located, you can access them as follows:

print(prompt_template.variables)
# Output: ['name', 'age', 'location']

The PromptTemplate class can also understand any combination of delimiters. Following the example above, but getting creative with our delimiters:

template_text = "My name is :/name-!). I am :/age-!) years old and I am from :/location-!)."
prompt_template = PromptTemplate(text=template_text, delimiters=[":/", "-!)"])
print(prompt_template.variables)
# Output: ['name', 'age', 'location']

Once you have a PromptTemplate class instantiated, you can make use of its format method to construct the prompt text resulting from substituting values into the variables. To do so, a dictionary mapping the variable names to the values is passed:

value_dict = {
    "name": "Peter",
    "age": 20,
    "location": "Queens"
}
print(prompt_template.format(value_dict))
# Output: My name is Peter. I am 20 years old and I am from Queens

Note that once you initialize the PromptTemplate class, you don't need to worry about delimiters anymore, it will be handled for you.

phoenix.experimental.evals.llm_classify

def llm_classify(
    dataframe: pd.DataFrame,
    model: BaseEvalModel,
    template: Union[PromptTemplate, str],
    rails: List[str],
    system_instruction: Optional[str] = None,
    verbose: bool = False,
    use_function_calling_if_available: bool = True,
    provide_explanation: bool = False,
) -> pd.DataFrame

Classifies each input row of the dataframe using an LLM. Returns a pandas.DataFrame where the first column is named label and contains the classification labels. An optional column named explanation is added when provide_explanation=True.

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be classified. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (PromptTemplate or str): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to .format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class instance
rails (List[str]): A list of strings representing the possible output classes of the model's predictions.
system_instruction (Optional[str]): An optional system message for modals that support it
verbose (bool, optional): If True, prints detailed info to stdout such as model invocation parameters and details about retries and snapping to rails. Default False.
use_function_calling_if_available (bool, default=True): If True, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.
provide_explanation (bool, default=False): If True, provides an explanation for each classification label. A column named explanation is added to the output dataframe. Currently, this is only available for models with function calling.

Returns

pandas.DataFrame: A dataframe where the label column (at column position 0) contains the classification labels. If provide_explanation=True, then an additional column named explanation is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or "NOT_PARSABLE" if the model's output could not be parsed.

phoenix.experimental.run_relevance_eval

def run_relevance_eval(
    dataframe: pd.DataFrame,
    model: BaseEvalModel,
    template: Union[PromptTemplate, str] = RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    rails: List[str] = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    system_instruction: Optional[str] = None,
    query_column_name: str = "query",
    document_column_name: str = "reference",
) -> List[List[str]]:

Given a pandas dataframe containing queries and retrieved documents, classifies the relevance of each retrieved document to the corresponding query using an LLM.

Parameters

dataframe (pd.DataFrame): A pandas dataframe containing queries and retrieved documents. If both query_column_name and reference_column_name are present in the input dataframe, those columns are used as inputs and should appear in the following format:
- The entries of the query column must be strings.
- The entries of the documents column must be lists of strings. Each list may contain an arbitrary number of document texts retrieved for the corresponding query.
- If the input dataframe is lacking either query_column_name or reference_column_name but has query and retrieved document columns in OpenInference trace format named "attributes.input.value" and "attributes.retrieval.documents", respectively, then those columns are used as inputs and should appear in the following format:
  - The entries of the query column must be strings.
  - The entries of the document column must be lists of OpenInference document objects, each object being a dictionary that stores the document text under the key "document.content".
model (BaseEvalModel): The model used for evaluation.
template (Union[PromptTemplate, str], optional): The template used for evaluation.
rails (List[str], optional): A list of strings representing the possible output classes of the model's predictions.
query_column_name (str, optional): The name of the query column in the dataframe, which should also be a template variable.
reference_column_name (str, optional): The name of the document column in the dataframe, which should also be a template variable.
system_instruction (Optional[str], optional): An optional system message.

Returns

evaluations (List[List[str]]): A list of relevant and not relevant classifications. The "shape" of the list should mirror the "shape" of the retrieved documents column, in the sense that it has the same length as the input dataframe and each sub-list has the same length as the corresponding list in the retrieved documents column. The values in the sub-lists are either entries from the rails argument or "NOT_PARSABLE" in the case where the LLM output could not be parsed.

phoenix.experimental.evals.llm_generate

def llm_generate(
    dataframe: pd.DataFrame,
    template: Union[PromptTemplate, str],
    model: Optional[BaseEvalModel] = None,
    system_instruction: Optional[str] = None,
) -> List[str]

Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be used as in input to the template. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (Union[PromptTemplate, str]): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class.
system_instruction (Optional[str], optional): An optional system message.

Returns

generations (List[Optional[str]]): A list of strings representing the output of the model for each record

https://github.com/Arize-ai/phoenix/blob/main/scripts/rag/llama_index_w_evals_and_qa.py

# type:ignore
"""
Llama Index implementation of a chunking and query testing system
"""

import datetime
import logging
import os
import pickle
import time
from typing import Dict, List

import cohere
import numpy as np
import pandas as pd
import requests
import tiktoken
from bs4 import BeautifulSoup
from llama_index.core import (
    Document,
    ServiceContext,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.indices.query.query_transform.base import StepDecomposeQueryTransform
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.query_engine import MultiStepQueryEngine, TransformQueryEngine
from llama_index.legacy import (
    LLMPredictor,
)
from llama_index.legacy.readers.web import BeautifulSoupWebReader
from llama_index.llms.openai import OpenAI
from llama_index.postprocessor.cohere_rerank import CohereRerank
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from openinference.semconv.trace import DocumentAttributes, SpanAttributes
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from plotresults import (
    plot_latency_graphs,
    plot_mean_average_precision_graphs,
    plot_mean_precision_graphs,
    plot_mrr_graphs,
    plot_ndcg_graphs,
    plot_percentage_incorrect,
)
from sklearn.metrics import ndcg_score

import phoenix as px
import phoenix.evals.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)
from phoenix.evals.models import BaseModel, set_verbosity

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
LOGGING_LEVEL = 20  # INFO
logging.basicConfig(level=LOGGING_LEVEL)
logger = logging.getLogger("evals")


DOCUMENT_CONTENT = DocumentAttributes.DOCUMENT_CONTENT
INPUT_VALUE = SpanAttributes.INPUT_VALUE
RETRIEVAL_DOCUMENTS = SpanAttributes.RETRIEVAL_DOCUMENTS
OPENINFERENCE_QUERY_COLUMN_NAME = "attributes." + INPUT_VALUE
OPENINFERENCE_DOCUMENT_COLUMN_NAME = "attributes." + RETRIEVAL_DOCUMENTS

OPENAI_MODEL_TOKEN_LIMIT_MAPPING = {
    "gpt-3.5-turbo-instruct": 4096,
    "gpt-3.5-turbo-0301": 4096,
    "gpt-3.5-turbo-0613": 4096,  # Current gpt-3.5-turbo default
    "gpt-3.5-turbo-16k-0613": 16385,
    "gpt-4-0314": 8192,
    "gpt-4-0613": 8192,  # Current gpt-4 default
    "gpt-4-32k-0314": 32768,
    "gpt-4-32k-0613": 32768,
    "gpt-4-1106-preview": 128000,
    "gpt-4-vision-preview": 128000,
}

ANTHROPIC_MODEL_TOKEN_LIMIT_MAPPING = {
    "claude-2.1": 200000,
    "claude-2.0": 100000,
    "claude-instant-1.2": 100000,
}

# https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models
GEMINI_MODEL_TOKEN_LIMIT_MAPPING = {
    "gemini-pro": 32760,
    "gemini-pro-vision": 16384,
}

BEDROCK_MODEL_TOKEN_LIMIT_MAPPING = {
    "anthropic.claude-instant-v1": 100 * 1024,
    "anthropic.claude-v1": 100 * 1024,
    "anthropic.claude-v2": 100 * 1024,
    "amazon.titan-text-express-v1": 8 * 1024,
    "ai21.j2-mid-v1": 8 * 1024,
    "ai21.j2-ultra-v1": 8 * 1024,
}

MODEL_TOKEN_LIMIT = {
    **OPENAI_MODEL_TOKEN_LIMIT_MAPPING,
    **ANTHROPIC_MODEL_TOKEN_LIMIT_MAPPING,
    **GEMINI_MODEL_TOKEN_LIMIT_MAPPING,
    **BEDROCK_MODEL_TOKEN_LIMIT_MAPPING,
}


def get_encoder(model: BaseModel) -> tiktoken.Encoding:
    try:
        encoding = tiktoken.encoding_for_model(model._model_name)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return encoding


def max_context_size(model: BaseModel) -> int:
    # default to 4096
    return MODEL_TOKEN_LIMIT.get(model._model_name, 4096)


def get_tokens_from_text(encoder: tiktoken.Encoding, text: str) -> List[int]:
    return encoder.encode(text)


def get_text_from_tokens(encoder: tiktoken.Encoding, tokens: List[int]) -> str:
    return encoder.decode(tokens)


def truncate_text_by_model(model: BaseModel, text: str, token_buffer: int = 0) -> str:
    """Truncates text using a give model token limit.
    Args:
        model (BaseModel): The model to use as reference.
        text (str): The text to be truncated.
        token_buffer (int, optional): The number of tokens to be left as buffer. For example, if the
        `model` has a token limit of 1,000 and we want to leave a buffer of 50, the text will be
        truncated such that the resulting text comprises 950 tokens. Defaults to 0.
    Returns:
        str: Truncated text
    """
    encoder = get_encoder(model)
    max_token_count = max_context_size(model) - token_buffer
    tokens = get_tokens_from_text(encoder, text)
    if len(tokens) > max_token_count:
        return get_text_from_tokens(encoder, tokens[:max_token_count]) + "..."
    return text


def concatenate_and_truncate_chunks(chunks: List[str], model: BaseModel, token_buffer: int) -> str:
    """_summary_"""
    """Given a list of `chunks` of text, this function will return the concatenated chunks
    truncated to a token limit given by the `model` and `token_buffer`. See the function
    `truncate_text_by_model` for information on the truncation process.
    Args:
        chunks (List[str]): A list of pieces of text.
        model (BaseModel): The model to use as reference.
        token_buffer (int): The number of tokens to be left as buffer. For example, if the
        `model` has a token limit of 1,000 and we want to leave a buffer of 50, the text will be
        truncated such that the resulting text comprises 950 tokens. Defaults to 0.
    Returns:
        str: A prompt string that fits within a model's context window.
    """
    return truncate_text_by_model(model=model, text=" ".join(chunks), token_buffer=token_buffer)


# URL and Website download utilities
def get_urls(base_url: str) -> List[str]:
    if not base_url.endswith("/"):
        base_url = base_url + "/"
    page = requests.get(f"{base_url}sitemap.xml")
    scraper = BeautifulSoup(page.content, "xml")

    urls_from_xml = []

    loc_tags = scraper.find_all("loc")

    for loc in loc_tags:
        urls_from_xml.append(loc.get_text())

    return urls_from_xml


# Plots
def plot_graphs(all_data: Dict, save_dir: str = "./", show: bool = True, remove_zero: bool = True):
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    plot_latency_graphs(all_data, save_dir, show)
    plot_mean_average_precision_graphs(all_data, save_dir, show, remove_zero)
    plot_mean_precision_graphs(all_data, save_dir, show, remove_zero)
    plot_ndcg_graphs(all_data, save_dir, show, remove_zero)
    plot_mrr_graphs(all_data, save_dir, show, remove_zero)
    plot_percentage_incorrect(all_data, save_dir, show, remove_zero)


# LamaIndex performance optimizaitons
def get_transformation_query_engine(index, name, k, llama_index_model):
    if name == "original":
        # query cosine similarity to nodes engine
        service_context = ServiceContext.from_defaults(
            llm=OpenAI(temperature=float(0.6), model=llama_index_model),
        )
        query_engine = index.as_query_engine(
            similarity_top_k=k,
            response_mode="compact",
            service_context=service_context,
        )  # response mode can also be parameterized
        return query_engine
    elif name == "original_rerank":
        cohere_rerank = CohereRerank(api_key=cohere.api_key, top_n=k)
        service_context = ServiceContext.from_defaults(
            llm=OpenAI(temperature=0.6, model=llama_index_model)
        )
        query_engine = index.as_query_engine(
            similarity_top_k=k * 2,
            response_mode="refine",  # response mode can also be parameterized
            service_context=service_context,
            node_postprocessors=[cohere_rerank],
        )
        return query_engine
    elif name == "hyde":
        service_context = ServiceContext.from_defaults(
            llm=OpenAI(temperature=0.6, model=llama_index_model)  # change to model
        )
        query_engine = index.as_query_engine(
            similarity_top_k=k, response_mode="refine", service_context=service_context
        )
        hyde = HyDEQueryTransform(include_original=True)
        hyde_query_engine = TransformQueryEngine(query_engine, hyde)

        return hyde_query_engine

    elif name == "hyde_rerank":
        cohere_rerank = CohereRerank(api_key=cohere.api_key, top_n=k)

        service_context = ServiceContext.from_defaults(
            llm=OpenAI(temperature=0.6, model=llama_index_model),
        )
        query_engine = index.as_query_engine(
            similarity_top_k=k * 2,
            response_mode="compact",
            service_context=service_context,
            node_postprocessors=[cohere_rerank],
        )
        hyde = HyDEQueryTransform(include_original=True)
        hyde_rerank_query_engine = TransformQueryEngine(query_engine, hyde)

        return hyde_rerank_query_engine

    elif name == "multistep":
        gpt4 = OpenAI(temperature=0.6, model=llama_index_model)
        service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

        step_decompose_transform = StepDecomposeQueryTransform(LLMPredictor(llm=gpt4), verbose=True)

        multi_query_engine = MultiStepQueryEngine(
            query_engine=index.as_query_engine(
                service_context=service_context_gpt4, similarity_top_k=k
            ),
            query_transform=step_decompose_transform,
            index_summary="documentation",  # llama index isn't really clear on how this works
        )

        return multi_query_engine

    else:
        return


# Main run experiment function
def run_experiments(
    documents,
    queries,
    chunk_sizes,
    query_transformations,
    k_values,
    web_title,
    save_dir,
    llama_index_model,
    eval_model: BaseModel,
    template: str,
):
    logger.info(f"LAMAINDEX MODEL : {llama_index_model}")
    all_data = {}
    for chunk_size in chunk_sizes:
        logger.info(f"PARSING WITH CHUNK SIZE {chunk_size}")
        persist_dir = f"./indices/{web_title}_{chunk_size}"
        if os.path.isdir(persist_dir):
            logger.info("EXISTING INDEX FOUND, LOADING...")
            # Rebuild storage context
            storage_context = StorageContext.from_defaults(persist_dir=persist_dir)

            # Load index from the storage context
            index = load_index_from_storage(storage_context)
        else:
            logger.info("BUILDING INDEX...")
            node_parser = SimpleNodeParser.from_defaults(
                chunk_size=chunk_size, chunk_overlap=0
            )  # you can also experiment with the chunk overlap too
            nodes = node_parser.get_nodes_from_documents(documents)
            index = VectorStoreIndex(nodes, show_progress=True)
            index.storage_context.persist(persist_dir)

        engines = {}
        for k in k_values:  # <-- This is where we add the loop for k.
            # create different query transformation engines
            for name in query_transformations:
                this_engine = get_transformation_query_engine(index, name, k, llama_index_model)
                engines[name] = this_engine

            query_transformation_data = {name: [] for name in engines}
            # Loop through query engines - testing each
            for name in engines:
                engine = engines[name]
                if chunk_size not in all_data:
                    all_data[chunk_size] = {}
                if name not in all_data[chunk_size]:
                    all_data[chunk_size][name] = {}
                # these take some time to compute...
                for i, query in enumerate(queries):
                    logger.info("-" * 50)
                    logger.info(f"QUERY {i + 1}: {query}")
                    logger.info(f"TRANSFORMATION: {name}")
                    logger.info(f"CHUNK SIZE: {chunk_size}")
                    logger.info(f"K : {k}")

                    time_start = time.time()
                    # return engine, query
                    response = engine.query(query)
                    time_end = time.time()
                    response_latency = time_end - time_start

                    logger.info(f"RESPONSE: {response}")
                    logger.info(f"LATENCY: {response_latency:.2f}")
                    contexts = [
                        source_node.node.get_content() for source_node in response.source_nodes
                    ]

                    scores = [source_node.score for source_node in response.source_nodes]

                    row = (
                        [query, response.response]
                        + [response_latency]
                        + contexts
                        + [contexts]
                        + [scores]
                    )
                    query_transformation_data[name].append(row)

                    logger.info("-" * 50)

            columns = (
                ["query", "response"]
                + ["response_latency"]
                + [f"retrieved_context_{i}" for i in range(1, k + 1)]
                + ["retrieved_context_list"]
                + ["scores"]
            )

            for name, data in query_transformation_data.items():
                if name == "multistep":
                    df = pd.DataFrame(
                        data,
                        columns=[
                            "query",
                            "response",
                            "response_evaluation",
                            "response_latency",
                        ],
                    )
                    all_data[chunk_size][name][k] = df
                else:
                    df = pd.DataFrame(data, columns=columns)
                logger.info("RUNNING EVALS")
                time_start = time.time()
                df = df_evals(
                    df=df,
                    model=eval_model,
                    formatted_evals_column="retrieval_evals",
                    template=template,
                )
                time_end = time.time()
                eval_latency = time_end - time_start
                logger.info(f"EVAL LATENCY: {eval_latency:.2f}")
                # Calculate MRR/NDCG on top of Eval metrics
                df = calculate_metrics(df, k, formatted_evals_column="retrieval_evals")
                all_data[chunk_size][name][k] = df

            tmp_save_dir = save_dir + "tmp_" + str(chunk_size) + "/"
            # Save tmp plots
            plot_graphs(all_data=all_data, save_dir=tmp_save_dir, show=False)
            # Save tmp raw data
            with open(tmp_save_dir + "data_all_data.pkl", "wb") as file:
                pickle.dump(all_data, file)

    return all_data


# Running the main Phoenix Evals both Q&A and Retrieval
def df_evals(
    df: pd.DataFrame,
    model: BaseModel,
    formatted_evals_column: str,
    template: str,
):
    # Then use the function in a single call
    df["context"] = df["retrieved_context_list"].apply(
        lambda chunks: concatenate_and_truncate_chunks(chunks=chunks, model=model, token_buffer=700)
    )

    df = df.rename(
        columns={"query": "input", "response": "output", "retrieved_context_list": "reference"}
    )
    # Q&A Eval: Did the LLM get the answer right? Checking the LLM
    Q_and_A_classifications = llm_classify(
        dataframe=df,
        template=template,
        model=model,
        rails=["correct", "incorrect"],
    ).iloc[:, 0]
    df["qa_evals"] = Q_and_A_classifications
    # Retreival Eval: Did I have the relevant data to even answer the question?
    # Checking retrieval system

    df = df.rename(columns={"question": "input", "retrieved_context_list": "reference"})
    # query_column_name needs to also adjust the template to uncomment the
    # 2 fields in the function call below and delete the line above
    df[formatted_evals_column] = run_relevance_eval(
        dataframe=df,
        model=model,
        template=templates.RAG_RELEVANCY_PROMPT_TEMPLATE,
        rails=list(templates.RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
        query_column_name="input",
        document_column_name="reference",
    )

    # We want 0, 1 values for the metrics
    value_map = {"relevant": 1, "unrelated": 0, "UNPARSABLE": 0}
    df[formatted_evals_column] = df[formatted_evals_column].apply(
        lambda values: [value_map.get(value, 0) for value in values]
    )
    return df


# Calculatae performance metrics
def calculate_metrics(df, k, formatted_evals_column="formatted_evals"):
    df["data"] = df.apply(lambda row: process_row(row, formatted_evals_column, k), axis=1)
    # Separate the list of data into separate columns
    derived_columns = (
        [f"context_precision_at_{i}" for i in range(1, k + 1)]
        + [f"average_context_precision_at_{i}" for i in range(1, k + 1)]
        + [f"ndcg_at_{i}" for i in range(1, k + 1)]
        + [f"rank_at_{i}" for i in range(1, k + 1)]
    )
    df_new = pd.DataFrame(df["data"].tolist(), columns=derived_columns, index=df.index)
    # Concatenate this new DataFrame with the old one:
    df_combined = pd.concat([df, df_new], axis=1)
    # don't want the 'data' column anymore:
    df_combined.drop("data", axis=1, inplace=True)
    return df_combined


# Performance metrics
def compute_precision_at_i(eval_scores, i):
    return sum(eval_scores[:i]) / i


def compute_average_precision_at_i(evals, cpis, i):
    if np.sum(evals[:i]) == 0:
        return 0
    subset = cpis[:i]
    return (np.array(evals[:i]) @ np.array(subset)) / np.sum(evals[:i])


def get_rank(evals):
    for i, eval in enumerate(evals):
        if eval == 1:
            return i + 1
    return np.inf


# Run performance metrics on row of Evals data
def process_row(row, formatted_evals_column, k):
    formatted_evals = row[formatted_evals_column]
    cpis = [compute_precision_at_i(formatted_evals, i) for i in range(1, k + 1)]
    acpk = [compute_average_precision_at_i(formatted_evals, cpis, i) for i in range(1, k + 1)]
    ndcgis = [ndcg_score([formatted_evals], [row["scores"]], k=i) for i in range(1, k + 1)]
    ranki = [get_rank(formatted_evals[:i]) for i in range(1, k + 1)]
    return cpis + acpk + ndcgis + ranki


def check_keys() -> None:
    if os.getenv("OPENAI_API_KEY") is None:
        raise RuntimeError(
            "OpenAI API key missing. Please set it up in your environment as OPENAI_API_KEY"
        )
    cohere.api_key = os.getenv("COHERE_API_KEY")
    if cohere.api_key is None:
        raise RuntimeError(
            "Cohere API key missing. Please set it up in your environment as COHERE_API_KEY"
        )


def main():
    check_keys()

    # if loading from scratch, change these below
    web_title = "arize"  # nickname for this website, used for saving purposes
    base_url = "https://docs.arize.com/arize"
    # Local files
    file_name = "raw_documents.pkl"
    save_base = "./experiment_data/"
    if not os.path.exists(save_base):
        os.makedirs(save_base)
    run_name = datetime.datetime.now().strftime("%Y%m%d_%H%M")
    save_dir = os.path.join(save_base, run_name)
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # Read strings from CSV
    questions = pd.read_csv(
        "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv",
        header=None,
    )[0].to_list()

    raw_docs_filepath = os.path.join(save_base, file_name)
    # two options here, either get the documents from scratch or load one from disk
    if not os.path.exists(raw_docs_filepath):
        logger.info(f"'{raw_docs_filepath}' does not exists.")
        urls = get_urls(base_url)  # you need to - pip install lxml
        logger.info(f"LOADED {len(urls)} URLS")

        logger.info("GRABBING DOCUMENTS")
        logger.info("LOADING DOCUMENTS FROM URLS")
        # You need to 'pip install lxml'
        loader = BeautifulSoupWebReader()
        documents = loader.load_data(urls=urls)  # may take some time
        with open(raw_docs_filepath, "wb") as file:
            pickle.dump(documents, file)
        logger.info("Documents saved to raw_documents.pkl")
    else:
        logger.info("LOADING DOCUMENTS FROM FILE")
        logger.info("Opening raw_documents.pkl")
        with open(raw_docs_filepath, "rb") as file:
            documents = pickle.load(file)

    # convert legacy documents to new format
    documents = [Document(**document.__dict__) for document in documents]

    # Look for a URL in the output to open the App in a browser.
    px.launch_app()
    # The App is initially empty, but as you proceed with the steps below,
    # traces will appear automatically as your LlamaIndex application runs.

    # Run all of your LlamaIndex applications as usual and traces
    # will be collected and displayed in Phoenix.
    chunk_sizes = [
        # 100,
        # 300,
        500,
        # 1000,
        # 2000,
    ]  # change this, perhaps experiment from 500 to 3000 in increments of 500

    k = [4]  # , 6, 10]
    # k = [10]  # num documents to retrieve

    # transformations = ["original", "original_rerank","hyde", "hyde_rerank"]
    transformations = ["original"]

    llama_index_model = "gpt-4"
    eval_model = OpenAIModel(model_name="gpt-4", temperature=0.0)

    # QA template (using default)
    qa_template = templates.QA_PROMPT_TEMPLATE
    # Uncomment below when testing to limit number of questions
    # questions = [questions[1]]
    all_data = run_experiments(
        documents=documents,
        queries=questions,
        chunk_sizes=chunk_sizes,
        query_transformations=transformations,
        k_values=k,
        web_title=web_title,
        save_dir=save_dir,
        llama_index_model=llama_index_model,
        eval_model=eval_model,
        template=qa_template,
    )

    all_data_filepath = os.path.join(save_dir, f"{web_title}_all_data.pkl")
    with open(all_data_filepath, "wb") as f:
        pickle.dump(all_data, f)

    plot_graphs(
        all_data=all_data,
        save_dir=os.path.join(save_dir, "results_zero_removed"),
        show=False,
        remove_zero=True,
    )
    plot_graphs(
        all_data=all_data,
        save_dir=os.path.join(save_dir, "results_zero_not_removed"),
        show=False,
        remove_zero=False,
    )


def run_relevance_eval(
    dataframe,
    model,
    template,
    rails,
    query_column_name,
    document_column_name,
    verbose=False,
    system_instruction=None,
):
    """
    Given a pandas dataframe containing queries and retrieved documents, classifies the relevance of
    each retrieved document to the corresponding query using an LLM.
    Args:
        dataframe (pd.DataFrame): A pandas dataframe containing queries and retrieved documents. If
        both query_column_name and reference_column_name are present in the input dataframe, those
        columns are used as inputs and should appear in the following format:
        - The entries of the query column must be strings.
        - The entries of the documents column must be lists of strings. Each list may contain an
          arbitrary number of document texts retrieved for the corresponding query.
        If the input dataframe is lacking either query_column_name or reference_column_name but has
        query and retrieved document columns in OpenInference trace format named
        "attributes.input.value" and "attributes.retrieval.documents", respectively, then those
        columns are used as inputs and should appear in the following format:
        - The entries of the query column must be strings.
        - The entries of the document column must be lists of OpenInference document objects, each
          object being a dictionary that stores the document text under the key "document.content".
        This latter format is intended for running evaluations on exported OpenInference trace
        dataframes. For more information on the OpenInference tracing specification, see
        https://github.com/Arize-ai/openinference/.
        model (BaseEvalModel): The model used for evaluation.
        template (Union[PromptTemplate, str], optional): The template used for evaluation.
        rails (List[str], optional): A list of strings representing the possible output classes of
        the model's predictions.
        query_column_name (str, optional): The name of the query column in the dataframe, which
        should also be a template variable.
        reference_column_name (str, optional): The name of the document column in the dataframe,
        which should also be a template variable.
        system_instruction (Optional[str], optional): An optional system message.
        verbose (bool, optional): If True, prints detailed information to stdout such as model
        invocation parameters and retry info. Default False.
    Returns:
        List[List[str]]: A list of relevant and not relevant classifications. The "shape" of the
        list should mirror the "shape" of the retrieved documents column, in the sense that it has
        the same length as the input dataframe and each sub-list has the same length as the
        corresponding list in the retrieved documents column. The values in the sub-lists are either
        entries from the rails argument or "NOT_PARSABLE" in the case where the LLM output could not
        be parsed.
    """

    with set_verbosity(model, verbose) as verbose_model:
        query_column = dataframe.get(query_column_name)
        document_column = dataframe.get(document_column_name)
        if query_column is None or document_column is None:
            openinference_query_column = dataframe.get(OPENINFERENCE_QUERY_COLUMN_NAME)
            openinference_document_column = dataframe.get(OPENINFERENCE_DOCUMENT_COLUMN_NAME)
            if openinference_query_column is None or openinference_document_column is None:
                raise ValueError(
                    f'Dataframe columns must include either "{query_column_name}" and '
                    f'"{document_column_name}", or "{OPENINFERENCE_QUERY_COLUMN_NAME}" and '
                    f'"{OPENINFERENCE_DOCUMENT_COLUMN_NAME}".'
                )
            query_column = openinference_query_column
            document_column = openinference_document_column.map(
                lambda docs: _get_contents_from_openinference_documents(docs)
                if docs is not None
                else None
            )

        queries = query_column.tolist()
        document_lists = document_column.tolist()
        indexes = []
        expanded_queries = []
        expanded_documents = []
        for index, (query, documents) in enumerate(zip(queries, document_lists)):
            if query is None or documents is None:
                continue
            for document in documents:
                indexes.append(index)
                expanded_queries.append(query)
                expanded_documents.append(document)
        predictions = llm_classify(
            dataframe=pd.DataFrame(
                {
                    query_column_name: expanded_queries,
                    document_column_name: expanded_documents,
                }
            ),
            model=verbose_model,
            template=template,
            rails=rails,
            system_instruction=system_instruction,
            verbose=verbose,
        ).iloc[:, 0]
        outputs: List[List[str]] = [[] for _ in range(len(dataframe))]
        for index, prediction in zip(indexes, predictions):
            outputs[index].append(prediction)
        return outputs


def _get_contents_from_openinference_documents(documents):
    """
    Get document contents from an iterable of OpenInference document objects, which are dictionaries
    containing the document text under the "document.content" key.
    """
    return [doc.get(DOCUMENT_CONTENT) if isinstance(doc, dict) else None for doc in documents]


if __name__ == "__main__":
    program_start = time.time()
    main()
    program_end = time.time()
    total_time = (program_end - program_start) / (60 * 60)
    logger.info(f"EXPERIMENTS FINISHED: {total_time:.2f} hrs")

Import Your Data

How to create Phoenix datasets and schemas for common data formats

This guide shows you how to define a Phoenix dataset using your own data.

Once you have a pandas dataframe df containing your data and a schema object describing the format of your dataframe, you can define your Phoenix dataset either by running

ds = px.Dataset(df, schema)

or by optionally providing a name for your dataset that will appear in the UI:

ds = px.Dataset(df, schema, name="training")

As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.

Predictions and Actuals

Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime objects that represent the time at which each inference was made in production.

Dataframe

timestamp

prediction_score

prediction

target

2023-03-01 02:02:19

0.91

click

2023-02-17 23:45:48

0.37

no_click

2023-01-30 15:30:03

0.54

click

no_click

2023-02-03 19:56:09

0.74

click

2023-02-24 04:23:43

0.37

no_click

click

Schema

schema = px.Schema(
    timestamp_column_name="timestamp",
    prediction_score_column_name="prediction_score",
    prediction_label_column_name="prediction",
    actual_label_column_name="target",
)

This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.

Features and Tags

Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.

Dataframe

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

age

gender

predicted

target

578

Scammeds

4300

62966

RENT

110

male

not_fraud

fraud

507

Schiller Ltd

21000

52335

RENT

129

female

not_fraud

656

Kirlin and Sons

18000

94995

MORTGAGE

female

uncertain

414

Scammeds

18000

32034

LEASE

male

fraud

not_fraud

512

Champlin and Sons

20000

46005

OWN

148

male

uncertain

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    feature_column_names=[
        "fico_score",
        "merchant_id",
        "loan_amount",
        "annual_income",
        "home_ownership",
        "num_credit_lines",
        "inquests_in_last_6_months",
        "months_since_last_delinquency",
    ],
    tag_column_names=[
        "age",
        "gender",
    ],
)

Implicit Features

If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names field of your schema set to its default value of None, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.

Dataframe

target

predicted

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
)

Excluded Columns

You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.

Dataframe

target

predicted

hospital

insurance_provider

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

Pacific Clinics

uninsured

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

Queens Hospital

Anthem Blue Cross

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

St. Francis Memorial Hospital

Blue Shield of CA

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

Pacific Clinics

Kaiser Permanente

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

CityMed

Anthem Blue Cross

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    excluded_column_names=[
        "hospital",
        "insurance_provider",
    ],
)

Embedding Features

Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames to associate multiple dataframe columns with the same embedding feature.

The example in this section contain low-dimensional embeddings for the sake of easy viewing. Your embeddings in practice will typically have much higher dimension.

Embedding Vectors

To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name field on px.EmbeddingColumnNames. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:

Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names is a dictionary.
The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
The values of this dictionary are instances of px.EmbeddingColumnNames.
Each entry in the "embedding_vector" column is a list of length 4.

Dataframe

predicted

target

embedding_vector

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

fraud

not_fraud

[-0.97, 3.98, -0.03, 2.92]

604

Leannon Ward

22000

100781

RENT

108

fraud

not_fraud

[3.20, 3.95, 2.81, -0.09]

612

Scammeds

7500

116184

MORTGAGE

not_fraud

[-0.49, -0.62, 0.08, 2.03]

646

Leannon Ward

32000

73666

RENT

131

not_fraud

[1.69, 0.01, -0.76, 3.64]

560

Kirlin and Sons

19000

38589

MORTGAGE

131

uncertain

[1.46, 0.69, 3.26, -0.17]

636

Champlin and Sons

10000

100251

MORTGAGE

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    embedding_feature_column_names={
        "transaction_embeddings": px.EmbeddingColumnNames(
            vector_column_name="embedding_vector"
        ),
    },
)

To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.

Embeddings of Images

If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name field on px.EmbeddingColumnNames. The following example contains data for an image classification model that detects product defects on an assembly line.

Dataframe

defective

image

image_vector

okay

https://www.example.com/image0.jpeg

[1.73, 2.67, 2.91, 1.79, 1.29]

defective

https://www.example.com/image1.jpeg

[2.18, -0.21, 0.87, 3.84, -0.97]

okay

https://www.example.com/image2.jpeg

[3.36, -0.62, 2.40, -0.94, 3.69]

defective

https://www.example.com/image3.jpeg

[2.77, 2.79, 3.36, 0.60, 3.10]

okay

https://www.example.com/image4.jpeg

[1.79, 2.06, 0.53, 3.58, 0.24]

Schema

schema = px.Schema(
    actual_label_column_name="defective",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Local Images

For local image data, we recommend the following steps to serve your images via a local HTTP server:

In your terminal, navigate to a directory containing your image data and run python -m http.server 8000.
Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.

For example, suppose your HTTP server is running in a directory with the following contents:

.
└── image-data
    └── example_image.jpeg

Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.

Embeddings of Text

If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name field on px.EmbeddingColumnNames. The embeddings below were generated by a sentiment classification model trained on product reviews.

Dataframe

name

text

text_vector

Schema

schema = px.Schema(
    actual_label_column_name="sentiment",
    feature_column_names=[
        "category",
    ],
    tag_column_names=[
        "name",
    ],
    embedding_feature_column_names={
        "product_review_embeddings": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        ),
    },
)

Multiple Embedding Features

Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.

Dataframe

name

description

description_vector

image

image_vector

Magic Lamp

Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.

[2.47, -0.01, -0.22, 0.93]

https://www.example.com/image0.jpeg

[2.42, 1.95, 0.81, 2.60, 0.27]

Ergo Desk Chair

The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.

[-0.25, 0.07, 2.90, 1.57]

https://www.example.com/image1.jpeg

[3.17, 2.75, 1.39, 0.44, 3.30]

Cloud Nine Mattress

Our Cloud Nine Mattress combines cool comfort with maximum affordability.

[1.36, -0.88, -0.45, 0.84]

https://www.example.com/image2.jpeg

[-0.22, 0.87, 1.10, -0.78, 1.25]

Dr. Fresh's Spearmint Toothpaste

Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula

[-0.39, 1.29, 0.92, 2.51]

https://www.example.com/image3.jpeg

[1.95, 2.66, 3.97, 0.90, 2.86]

Ultra-Fuzzy Bath Mat

The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom.

[0.37, 3.22, 1.29, 0.65]

https://www.example.com/image4.jpeg

[0.77, 1.79, 0.52, 3.79, 0.47]

Schema

schema = px.Schema(
    tag_column_names=["name"],
    embedding_feature_column_names={
        "description_embedding": px.EmbeddingColumnNames(
            vector_column_name="description_vector",
            raw_data_column_name="description",
        ),
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Distinct embedding features may have embedding vectors of differing length. The text embeddings in the above example have length 4 while the image embeddings have length 5.