1 of 100

English

Arize Phoenix

AI Observability and Evaluation

Features

Quickstarts

Running Phoenix for the first time? Select a quickstart below.

Next Steps

Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.

Add instrumentation for popular packages and libraries such as OpenAI, LangGraph, Vercel AI SDK and more.

Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.

Quickstarts

Not sure where to start? Try a quickstart:

Phoenix runs seamlessly in notebooks, containers, Phoenix Cloud, or from the terminal.

Environments

The Phoenix app can be run in various environments such as Colab and SageMaker notebooks, as well as be served via the terminal or a docker container.

Phoenix Cloud

If you're using Phoenix Cloud, be sure to set the proper environment variables to connect to your instance:

export PHOENIX_CLIENT_HEADERS = "api_key=ENTER YOUR API KEY"
export PHOENIX_COLLECTOR_ENDPOINT = "https://app.phoenix.arize.com"

Container

Notebooks

To start phoenix in a notebook environment, run:

import phoenix as px

session = px.launch_app()

This will start a local Phoenix server. You can initialize the phoenix server with various kinds of data (traces, inferences).

By default, Phoenix does not persist your data when run in a notebook.

Terminal

If you want to start a phoenix server to collect traces, you can also run phoenix directly from the command line:

phoenix serve

This will start the phoenix server on port 6006. If you are running your instrumented notebook or application on the same machine, traces should automatically be exported to http://127.0.0.1:6006 so no additional configuration is needed. However if the server is running remotely, you will have to modify the environment variable PHOENIX_COLLECTOR_ENDPOINT to point to that machine (e.g. http://<my-remote-machine>:<port>)

Tracing

Overview: Tracing

Tracing the execution of LLM applications using Telemetry

Phoenix traces AI applications, via OpenTelemetry and has first-class integrations with LlamaIndex, Langchain, OpenAI, and others.

LLM tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation to provide a detailed timeline of the request's execution.

Using Phoenix's tracing capabilities can provide important insights into the inner workings of your LLM application. By analyzing the collected trace data, you can identify and address various performance and operational issues and improve the overall reliability and efficiency of your system.

Application Latency: Identify and address slow invocations of LLMs, Retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
Token Usage: Gain a detailed breakdown of token usage for your LLM calls, allowing you to identify and optimize the most expensive LLM invocations.
Runtime Exceptions: Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
Retrieved Documents: Inspect the documents retrieved during a Retriever call, including the score and order in which they were returned to provide insight into the retrieval process.
Embeddings: Examine the embedding text used for retrieval and the underlying embedding model to allow you to validate and refine your embedding strategies.
LLM Parameters: Inspect the parameters used when calling an LLM, such as temperature and system prompts, to ensure optimal configuration and debugging.
Prompt Templates: Understand the prompt templates used during the prompting step and the variables that were applied, allowing you to fine-tune and improve your prompting strategies.
Tool Descriptions: View the descriptions and function signatures of the tools your LLM has been given access to in order to better understand and control your LLM’s capabilities.
LLM Function Calls: For LLMs with function call capabilities (e.g., OpenAI), you can inspect the function selection and function messages in the input to the LLM, further improving your ability to debug and optimize your application.

By using tracing in Phoenix, you can gain increased visibility into your LLM application, empowering you to identify and address performance bottlenecks, optimize resource utilization, and ensure the overall reliability and effectiveness of your system.

Next steps

Quickstart: Tracing

Phoenix supports three main options to collect traces:

Quickstarts

Explore a Demo Trace

Features: Tracing

Tracing is a critical part of AI Observability and should be used both in production and development

Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.

This section contains details on Tracing features:

Projects
Annotations
Sessions

Projects

Use projects to organize your LLM traces

Projects provide organizational structure for your AI applications, allowing you to logically separate your observability data. This separation is essential for maintaining clarity and focus.

With Projects, you can:

Segregate traces by environment (development, staging, production)
Isolate different applications or use cases
Track separate experiments without cross-contamination
Maintain dedicated evaluation spaces for specific initiatives
Create team-specific workspaces for collaborative analysis

Projects act as containers that keep related traces and conversations together while preventing them from interfering with unrelated work. This organization becomes increasingly valuable as you scale - allowing you to easily switch between contexts without losing your place or mixing data.

The Project structure also enables comparative analysis across different implementations, models, or time periods. You can run parallel versions of your application in separate projects, then analyze the differences to identify improvements or regressions.

Explore a Demo Project

Annotations

In order to improve your LLM application iteratively, it's vital to collect feedback, annotate data during human review, as well as to establish an evaluation pipeline so that you can monitor your application. In Phoenix we capture this type of feedback in the form of annotations.

Phoenix gives you the ability to annotate traces with feedback from the UI, your application, or wherever you would like to perform evaluation. Phoenix's annotation model is simple yet powerful - given an entity such as a span that is collected, you can assign a label and/or a score to that entity.

Navigate to the Feedback tab in this demo trace to see how LLM-based evaluations appear in Phoenix:

Next Steps

Configure Annotation Configs to guide human annotations.
Learn how to log annotations via the client from your app or in a notebook

Sessions

Track and analyze multi-turn conversations

Sessions enable tracking and organizing related traces across multi-turn conversations with your AI application. When building conversational AI, maintaining context between interactions is critical - Sessions make this possible from an observability perspective.

With Sessions in Phoenix, you can:

Track the entire history of a conversation in a single thread
View conversations in a chatbot-like UI showing inputs and outputs of each turn
Search through sessions to find specific interactions
Track token usage and latency per conversation

This feature is particularly valuable for applications where context builds over time, like chatbots, virtual assistants, or any other multi-turn interaction. By tagging spans with a consistent session ID, you create a connected view that reveals how your application performs across an entire user journey.

Next Steps

Check out how to Setup Sessions

Setup Tracing

Setup using Phoenix OTEL

Learn how to use the phoenix.otel library

Setup using base OTEL

Learn how you can use basic OpenTelemetry to instrument your application.

Using Phoenix Decorators

Learn how to use Phoenix's decorators to easily instrument specific methods or code blocks in your application.

Setup Tracing (TS)

Setup tracing for your TypeScript application.

Setup Projects

Learn about Projects in Phoenix, and how to use them.

Setup Sessions

Understand Sessions and how they can be used to group user conversations.

Setup Projects

Log to a specific project

Phoenix uses projects to group traces. If left unspecified, all traces are sent to a default project.

Projects work by setting something called the Resource attributes (as seen in the OTEL example above). The phoenix server uses the project name attribute to group traces into the appropriate project.

Switching projects in a notebook

Typically you want traces for an LLM app to all be grouped in one project. However, while working with Phoenix inside a notebook, we provide a utility to temporarily associate spans with different projects. You can use this to trace things like evaluations.

from phoenix.trace import using_project

# Switch project to run evals
with using_project("my-eval-project"):
    # all spans created within this context will be associated with
    # the "my-eval-project" project.
    # Run evaluations here...

Add Metadata

Tracing can be augmented and customized by adding Metadata. Metadata includes your own custom attributes, user ids, session ids, prompt templates, and more.

Add Attributes, Metadata, Users

Learn how to add custom metadata and attributes to your traces

Instrument Prompt Templates and Prompt Variables

Learn how to define custom prompt templates and variables in your tracing.

Instrument Prompt Templates and Prompt Variables

Annotate Traces

Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness

Phoenix allows you to annotate traces through the Client, the REST API, or the UI.

Guides

Annotating in the UI

How to annotate traces in the UI for analysis and dataset curation

Configuring Annotations

To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.

Adding Annotations

Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate button and fill out the form to rate different steps in your AI application. You can also take notes as you go by either clicking on the explain link or by adding your notes to the bottom messages UI. You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations tab.

Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.

Viewing Annotations

As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.

Exporting Traces with specific Annotation Values

Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.

Importing & Exporting Traces

Import Existing Traces

Learn how to load a file of traces into Phoenix

Export Data & Query Spans

Learn how to export trace data from Phoenix

Import Existing Traces

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.

Importing Traces to an Existing Phoenix Instance

Launching a new Phoenix Instance with Saved Traces

You can also launch a temporary version of Phoenix in your local notebook to quickly view the traces. But be warned, this Phoenix instance will only last as long as your notebook environment is runing

Exporting Annotated Spans

Span annotations can be an extremely valuable basis for improving your application. The Phoenix client provides useful ways to pull down spans and their associated annotations. This information can be used to:

build new LLM judges
form the basis for new datasets
help identify ideas for improving your application

Pulling Spans

If you only want the spans that contain a specific annotation, you can pass in a query that filters on annotation names, scores, or labels.

The queries can also filter by annotation scores and labels.

This spans dataframe can be used to pull associated annotations.

Instead of an input dataframe, you can also pass in a list of ids:

The annotations and spans dataframes can be easily joined to produce a one-row-per-annotation dataframe that can be used to analyze the annotations!

Advanced

Mask Span Attributes

Learn how to block PII from logging to Phoenix

Suppress Tracing

Learn how to selectively block or turn off tracing

Filter Spans to Export

Learn how to send only certain spans to Phoenix

Capture Multimodal Traces

Learn how to trace images

Suppress Tracing

How to turn off tracing

Tracing can be paused temporarily or disabled permanently.

Pause tracing using context manager

If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing context manager as shown below.

from phoenix.trace import suppress_tracing

with suppress_tracing():
    # Code running inside this block doesn't generate traces.
    # For example, running LLM evals here won't generate additional traces.
    ...
# Tracing will resume outside the block.
...

Uninstrument the auto-instrumentors permanently

Calling .uninstrument() on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.

LangChainInstrumentor().uninstrument()
LlamaIndexInstrumentor().uninstrument()
OpenAIInstrumentor().uninstrument()
# etc.

Filter Spans to Export

Sometimes while instrumenting your application, you may want to filter out or modify certain spans from being sent to Phoenix. For example, you may want to filter out spans that are that contain sensitive information or contain redundant information.

To do this, you can use a custom SpanProcessor and attach it to the OpenTelemetry TracerProvider.

Capture Multimodal Traces

Phoenix supports displaying images that are included in LLM traces.

To view images in Phoenix

Include either a base64 UTF-8 encoded image or an image url in the call made to your LLM

Example

pip install -q "arize-phoenix>=4.29.0" openinference-instrumentation-openai openai

# Check if PHOENIX_API_KEY is present in the environment variables.
# If it is, we'll use the cloud instance of Phoenix. If it's not, we'll start a local instance.
# A third option is to connect to a docker or locally hosted instance.
# See https://arize.com/docs/phoenix/setup/environments for more information.

# Launch Phoenix
import os
if "PHOENIX_API_KEY" in os.environ:
    os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
    os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

else:
    import phoenix as px

    px.launch_app().view()

# Connect to Phoenix
from phoenix.otel import register
tracer_provider = register()

# Instrument OpenAI calls in your application
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

# Make a call to OpenAI with an image provided
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

You should see your image appear in Phoenix:

Prompt Engineering

Overview: Prompts

Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.

Unlike traditional software, AI applications are non-deterministic and depend on natural language to provide context and guide model output. The pieces of natural language and associated model parameters embedded in your program are known as “prompts.”

Optimizing your prompts is typically the highest-leverage way to improve the behavior of your application, but “prompt engineering” comes with its own set of challenges. You want to be confident that changes to your prompts have the intended effect and don’t introduce regressions.

Prompt Engineering Features

Phoenix offers a comprehensive suite of features to streamline your prompt engineering workflow.

Explore Demo Prompts

Prompt Management

Version and track changes made to prompt templates

Key benefits of prompt management include:

Reusability: Store and load prompts across different use cases.
Versioning: Track changes over time to ensure that the best performing version is deployed for use in your application.
Collaboration: Share prompts with others to maintain consistency and facilitate iteration.

To learn how to get started with prompt management, see Create a prompt

Prompt Playground

Span Replay

Replay LLM spans traced in your application directly in the playground

Have you ever wanted to go back into a multi-step LLM chain and just replay one step to see if you could get a better outcome? Well you can with Phoenix's Span Replay. LLM spans that are stored within Phoenix can be loaded into the Prompt Playground and replayed. Replaying spans inside of Playground enables you to debug and improve the performance of your LLM systems by comparing LLM provider outputs, tweaking model parameters, changing prompt text, and more.

Chat completions generated inside of Playground are automatically instrumented, and the recorded spans are immediately available to be replayed inside of Playground.

Prompts in Code

Pull and push prompt changes via Phoenix's Python and TypeScript Clients

Using Phoenix as a backend, Prompts can be managed and manipulated via code by using our Python or TypeScript SDKs.

With the Phoenix Client SDK you can:

Quickstart: Prompts

Quickstarts

How to: Prompts

Guides on how to do prompt engineering with Phoenix

Getting Started

Prompt Management

Organize and manage prompts with Phoenix to streamline your development workflow

Prompt management is currently available on a feature branch only and will be released in the next major version.

Playground

Iterate on prompts and models in the prompt playground

Using the Playground

General guidelines on how to use Phoenix's prompt playground

Setup

To first get started, you will first Configure AI Providers. In the playground view, create a valid prompt for the LLM and click Run on the top right (or the mod + enter)

If successful you should see the LLM output stream out in the Output section of the UI.

Prompt Editor

Model Configuration

Every prompt instance can be configured to use a specific LLM and set of invocation parameters. Click on the model configuration button at the top of the prompt editor and configure your LLM of choice. Click on the "save as default" option to make your configuration sticky across playground sessions.

Comparing Prompts

The Prompt Playground offers the capability to compare multiple prompt variants directly within the playground. Simply click the + Compare button at the top of the first prompt to create duplicate instances. Each prompt variant manages its own independent template, model, and parameters. This allows you to quickly compare prompts (labeled A, B, C, and D in the UI) and run experiments to determine which prompt and model configuration is optimal for the given task.

Using Datasets with Prompts

Playground Traces

All invocations of an LLM via the playground is recorded for analysis, annotations, evaluations, and dataset curation.

If you simply run an LLM in the playground using the free form inputs (e.g. not using a dataset), Your spans will be recorded in a project aptly titled "playground".

If however you run a prompt over dataset examples, the outputs and spans from your playground runs will be captured as an experiment. Each experiment will be named according to the prompt you ran the experiment over.

Test a prompt

Testing your prompts before you ship them is vital to deploying reliable AI applications

Testing in the Playground

Testing a prompt in the playground

The Playground is a fast and efficient way to refine prompt variations. You can load previous prompts and validate their performance by applying different variables.

Each single-run test in the Playground is recorded as a span in the Playground project, allowing you to revisit and analyze LLM invocations later. These spans can be added to datasets or reloaded for further testing.

Testing a prompt over a dataset

The ideal way to test a prompt is to construct a golden dataset where the dataset examples contains the variables to be applied to the prompt in the inputs and the outputs contains the ideal answer you want from the LLM. This way you can run a given prompt over N number of examples all at once and compare the synthesized answers against the golden answers.

Testing prompt variations side-by-side

Prompt Playground supports side-by-side comparisons of multiple prompt variants. Click + Compare to add a new variant. Whether using Span Replay or testing prompts over a Dataset, the Playground processes inputs through each variant and displays the results for easy comparison.

Testing a prompt using code

Datasets & Experiments

Overview: Datasets & Experiments

The velocity of AI application development is bottlenecked by quality evaluations because AI engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. High quality evaluations are critical as they can help developers answer these types of questions with greater confidence.

Datasets

Datasets are integral to evaluation. They are collections of examples that provide the inputs and, optionally, expected reference outputs for assessing your application. Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are used to run experiments and evaluations to track improvements to your prompt, LLM, or other parts of your LLM application.

Experiments

In AI development, it's hard to understand how a change will affect performance. This breaks the dev flow, making iteration more guesswork than engineering.

Experiments and evaluations solve this, helping distill the indeterminism of LLMs into tangible feedback that helps you ship more reliable product.

Specifically, good evals help you:

Understand whether an update is an improvement or a regression
Drill down into good / bad examples
Compare specific examples vs. prior runs
Avoid guesswork

How-to: Datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning,

How to create datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning, and much more. Phoenix allows you to build datasets manually, programmatically, or from files.

Exporting datasets

Export datasets for offline analysis, evals, and fine-tuning.

Exporting to CSV - how to quickly download a dataset to use elsewhere

Exporting Datasets

Exporting to CSV

Want to just use the contents of your dataset in another context? Simply click on the export to CSV button on the dataset page and you are good to go!

Exporting for Fine-Tuning

Fine-tuning lets you get more out of the models available by providing:

Higher quality results than prompting
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests. Phoenix natively exports OpenAI Fine-Tuning JSONL as long as the dataset contains compatible inputs and outputs.

Exporting OpenAI Evals

How-to: Experiments

How to run experiments

How to use evaluators

Evaluation

Overview: Evals

Phoenix Evals come with:

Speed - Phoenix evals are designed for maximum speed and throughput. Evals run in batches and typically run 10x faster than calling the APIs directly.

How to: Evals

Run evaluations via a job to visualize in the UI as traces stream in.

Evaluate traces captured in Phoenix and export results to the Phoenix UI.

Evaluate tasks with multiple inputs/outputs (ex: text, audio, image) using versatile evaluation tasks.

User Frustration

Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.

User Frustration Eval Template

  You are given a conversation where between a user and an assistant.
  Here is the conversation:
  [BEGIN DATA]
  *****************
  Conversation:
  {conversation}
  *****************
  [END DATA]

  Examine the conversation and determine whether or not the user got frustrated from the experience.
  Frustration can range from midly frustrated to extremely frustrated. If the user seemed frustrated
  at the beginning of the conversation but seemed satisfied at the end, they should not be deemed
  as frustrated. Focus on how the user left the conversation.

  Your response must be a single word, either "frustrated" or "ok", and should not
  contain any text or characters aside from that word. "frustrated" means the user was left
  frustrated as a result of the conversation. "ok" means that the user did not get frustrated
  from the conversation.

The following is an example of code snippet showing how to use the eval above template:

from phoenix.evals import (
    USER_FRUSTRATION_PROMPT_RAILS_MAP,
    USER_FRUSTRATION_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(USER_FRUSTRATION_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=USER_FRUSTRATION_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

SQL Generation Eval

SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.

Example of a Question: How many artists have names longer than 10 characters?

Example Query Generated:

SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10

The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.

SQL Eval Template

SQL Evaluation Prompt:
-----------------------
You are tasked with determining if the SQL generated appropiately answers a given 
instruction taking into account its generated query and response.

Data:
-----
- [Instruction]: {question}
  This section contains the specific task or problem that the sql query is intended 
  to solve.

- [Reference Query]: {query_gen}
  This is the sql query submitted for evaluation. Analyze it in the context of the 
  provided instruction.

- [Provided Response]: {response}
  This is the response and/or conclusions made after running the sql query through 
  the database

Evaluation:
-----------
Your response should be a single word: either "correct" or "incorrect".
You must assume that the db exists and that columns are appropiately named.
You must take into account the response as additional information to determine the 
correctness.

Running an SQL Generation Eval

rails = list(SQL_GEN_EVAL_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)
relevance_classifications = llm_classify(
    dataframe=df,
    template=SQL_GEN_EVAL_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Agent Path Convergence

When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.

One way of doing this is to calculate convergence:

Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score: avg(minimum steps taken / steps taken for this run)

This will give a convergence score of 0-1, with 1 being a perfect score.

Agent Planning

This template evaluates a plan generated by an agent. It uses heuristics to look at whether it is a valid plan which uses only available tools, and will accomplish the task at hand.

Prompt Template

Agent Reflection

Use this prompt template to evaluate an agent's final response. This is an optional step, which you can use as a gate to retry a set of actions if the response or state of the world is insufficient for the given task.

You are an expert in {topic}. I will give you a user query. Your task is to reflect on your provided solution and whether it has solved the problem.
First, explain whether you believe the solution is correct or incorrect.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, create a list of detailed instructions to help you correctly solve this problem in the future if it is incorrect.

Be concise in your response; however, capture all of the essential information.

Here is the data:
    [BEGIN DATA]
    ************
    [User Query]: {user_query}
    ************
    [Tools]: {tool_definitions}
    ************
    [State]: {current_state}
    ************
    [Provided Solution]: {solution}
    [END DATA]

Audio Emotion Detection

The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.

Template Details

The following is the structure of the EMOTION_PROMPT_TEMPLATE:

Template Module

The prompt and evaluation logic are part of the phoenix.evals.default_audio_templates module and are defined as:

EMOTION_AUDIO_RAILS: Output options for the evaluation template.
EMOTION_PROMPT_TEMPLATE: Prompt used for evaluating audio emotions.

Online Evals

This example:

Continuously queries a LangChain application to send new traces and spans to your Phoenix session
Queries new spans once per minute and runs evals, including:
- Hallucination
- Q&A Correctness
- Relevance
Logs evaluations back to Phoenix so they appear in the UI

The evaluation script is run as a cron job, enabling you to adjust the frequency of the evaluation job:

The above script can be run periodically to augment Evals in Phoenix.

Retrieval

Overview: Retrieval

Many LLM applications use a technique called Retrieval Augmented Generation. These applications retrieve data from their knowledge base to help the LLM accomplish tasks with the appropriate context.

However, these retrieval systems can still hallucinate or provide answers that are not relevant to the user's input query. We can evaluate retrieval systems by checking for:

Are there certain types of questions the chatbot gets wrong more often?
Are the documents that the system retrieves irrelevant? Do we have the right documents to answer the question?
Does the response match the provided documents?

Phoenix supports retrievals troubleshooting and evaluation on both traces and inferences, but inferences are currently required to visualize your retrievals using a UMAP. See below on the differences.

inferences

How-to: Inferences

How to export your data for labeling, evaluation, or fine-tuning

Prompt and Response (LLM)

How to import prompt and response from Large Large Model (LLM)

Dataframe

Below shows a relevant subsection of the dataframe. The embedding of the prompt is also shown.

Schema

Inferences

Define the inferences by pairing the dataframe with the schema.

Application

Retrieval (RAG)

How to import data for the Retrieval-Augmented Generation (RAG) use case

Dataframe

Schema

Both the retrievals and scores are grouped under prompt_column_names along with the embedding of the query.

Inferences

Define the inferences by pairing the dataframe with the schema.

Application

Corpus Data

How to create Phoenix inferences and schemas for the corpus data

Inferences

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

Inferences

Define the inferences by pairing the dataframe with the schema.

Application

Run Experiments

The following are the key steps of running an experiment illustrated by simple example.

Setup

Make sure you have Phoenix and the instrumentors needed for the experiment setup. For this example we will use the OpenAI instrumentor to trace the LLM calls.

pip install arize-phoenix openinference-instrumentation-openai openai

Run Experiments

The key steps of running an experiment are:

Define/upload a Dataset (e.g. a dataframe)
- Each record of the dataset is called an Example
Define a task
- A task is a function that takes each Example and returns an output
Define Evaluators
- An Evaluator is a function evaluates the output for each Example
Run the experiment

We'll start by launching the Phoenix app.

import phoenix as px

px.launch_app()

Load a Dataset

A dataset can be as simple as a list of strings inside a dataframe. More sophisticated datasets can be also extracted from traces based on actual production data. Here we just have a small list of questions that we want to ask an LLM about the NBA games:

Create pandas dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "question": [
            "Which team won the most games?",
            "Which team won the most games in 2015?",
            "Who led the league in 3 point shots?",
        ]
    }
)

The dataframe can be sent to Phoenix via the Client. input_keys and output_keys are column names of the dataframe, representing the input/output to the task in question. Here we have just questions, so we left the outputs blank:

Upload dataset to Phoenix

import phoenix as px

dataset = px.Client().upload_dataset(
    dataframe=df,
    input_keys=["question"],
    output_keys=[],
    dataset_name="nba-questions",
)

Each row of the dataset is called an Example.

Create a Task

A task is any function/process that returns a JSON serializable output. Task can also be an async function, but we used sync function here for simplicity. If the task is a function of one argument, then that argument will be bound to the input field of the dataset example.

def task(x):
    return ...

For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results:

Set Up Database

import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]
conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

Set Up Prompt and LLM

from textwrap import dedent

import openai

client = openai.Client()
columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

LLM_MODEL = "gpt-4o"

columns_str = ",".join(column["column_name"] + ": " + column["column_type"] for column in columns)
system_prompt = dedent(f"""
You are a SQL expert, and you are given a single table named nba with the following columns:
{columns_str}\n
Write a SQL query corresponding to the user's
request. Return just the query text, with no formatting (backticks, markdown, etc.).""")


def generate_query(question):
    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content


def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


def text2sql(question):
    results = error = None
    try:
        query = generate_query(question)
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)
    return {"query": query, "results": results, "error": error}

Define task as a Function

Recall that each row of the dataset is encapsulated as Example object. Recall that the input keys were defined when we uploaded the dataset:

def task(x):
    return text2sql(x["question"])

More complex task inputs

More complex tasks can use additional information. These values can be accessed by defining a task function with specific parameter names which are bound to special values associated with the dataset example:

Parameter name

Description

Example

input

example input

def task(input): ...

expected

example output

def task(expected): ...

reference

alias for expected

def task(reference): ...

metadata

example metadata

def task(metadata): ...

example

Example object

def task(example): ...

A task can be defined as a sync or async function that takes any number of the above argument names in any order!

Define Evaluators

An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database:

def no_error(output) -> bool:
    return not bool(output.get("error"))


def has_results(output) -> bool:
    return bool(output.get("results"))

Run an Experiment

Instrument OpenAI

Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examine in the Phoenix UI:

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Run the Task and Evaluators

Running an experiment is as easy as calling run_experiment with the components we defined above. The results of the experiment will be show up in Phoenix:

from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, task=task, evaluators=[no_error, has_results])

Add More Evaluations

If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`.

from phoenix.experiments import evaluate_experiment

evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

If you no longer have access to the original experiment object, you can retrieve it from Phoenix using the get_experiment client method.

from phoenix.experiments import evaluate_experiment
import phoenix as px

experiment_id = "experiment-id" # set your experiment ID here
experiment = px.Client().get_experiment(experiment_id=experiment_id)
evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

Dry Run

Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment() and evaluate_experiment() both are equipped with a dry_run= parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True selects one sample from the dataset, and setting it to a number, e.g. dry_run=3, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.

Build a Multimodal Eval

Multimodal Templates

Multimodal evaluation templates enable users to evaluate tasks involving multiple input or output modalities, such as text, audio, or images. These templates provide a structured framework for constructing evaluation prompts, allowing LLMs to assess the quality, correctness, or relevance of outputs across diverse use cases.

The flexibility of multimodal templates makes them applicable to a wide range of scenarios, such as:

Evaluating emotional tone in audio inputs, such as detecting user frustration or anger.
Assessing the quality of image captioning tasks.
Judging tasks that combine image and text inputs to produce contextualized outputs.

These examples illustrate how multimodal templates can be applied, but their versatility supports a broad spectrum of evaluation tasks tailored to specific user needs.

ClassificationTemplate

ClassificationTemplate is a class used to create evaluation prompts that are more complex than a simple string for classification tasks. We can also build prompts that consist of multiple message parts. We may include text, audio, or images in these messages, enabling us to construct multimodal evals if the LLM supports multimodal inputs.

By defining a ClassificationTemplate we can construct multi-part and multimodal evaluation templates by combining multiple PromptPartTemplateobjects.

from phoenix.evals.templates import (
    ClassificationTemplate,
    PromptPartTemplate,
)

An evaluation prompt can consist of multiple PromptPartTemplate objects
Each PromptPartTemplate can have a different content type
Combine multiple PromptPartTemplate with templating variables to evaluate audio or image inputs

Structure of a ClassificationTemplate

A ClassificationTemplate consists of the following components:

Rails: These are the allowed classification labels for this evaluation task
Template: A list of PromptPartTemplate objects specifying the structure of the evaluation input. Each PromptPartTemplate includes:
- content_type: The type of content (e.g., TEXT, AUDIO, IMAGE).
- template: The string or object defining the content for that part.
Explanation_Template (optional): This is a separate structure used to generate explanations if explanations are enabled via llm_classify. If not enabled, this component is ignored.

Example: Intent Classification in Audio

The following example demonstrates how to create a ClassificationTemplate for an intent detection eval for a voice application:

from phoenix.evals.templates import (
    ClassificationTemplate,
    PromptPartContentType,
    PromptPartTemplate,
)

# Define valid classification labels (rails)
TONE_EMOTION_RAILS = ["positive", "neutral", "negative"]

# Create the classification template
template = ClassificationTemplate(
    rails=TONE_EMOTION_RAILS,  # Specify the valid output labels
    template=[
        # Prompt part 1: Task description
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template="""
            You are a helpful AI bot that checks for the tone of the audio.
            Analyze the audio file and determine the tone (e.g., positive, neutral, negative).
            Your evaluation should provide a multiclass label from the following options: ['positive', 'neutral', 'negative'].
            
            Here is the audio:
            """,
        ),
        # Prompt part 2: Insert the audio data
        PromptPartTemplate(
            content_type=PromptPartContentType.AUDIO,
            template="{audio}",  # Placeholder for the audio content
        ),
        # Prompt part 3: Define the response format
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template="""
            Your response must be a string, either positive, neutral, or negative, and should not contain any text or characters aside from that.
            """,
        ),
    ],
)

Adapting to Different Modalities

The flexibility of ClassificationTemplate allows users to adapt it for various modalities, such as:

Image Inputs: Replace PromptPartContentType.AUDIO with PromptPartContentType.IMAGE and update the templates accordingly.
Mixed Modalities: Combine TEXT, AUDIO, and IMAGE for multimodal tasks requiring contextualized inputs.

Running the Evaluation with `llm_classify`

The llm_classify function can be used to run multimodal evaluations. This function supports input in the following formats:

DataFrame: A DataFrame containing audio or image URLs, base64-encoded strings, and any additional data required for the evaluation.
List: A collection of data items (e.g., audio or image URLs, list of base64 encoded strings).

Key Considerations for Input Data

Public Links: If the data contains URLs for audio or image inputs, they must be publicly accessible for OpenAI to process them directly.
Base64-Encoding: For private or local data, users must encode audio or image files as base64 strings and pass them to the function.
Data Processor (optional): If links are not public and require transformation (e.g., base64 encoding), a data processor can be passed directly to llm_classify to handle the conversion in parallel, ensuring secure and efficient processing.

Using a Data Processor

A data processor enables efficient parallel processing of private or raw data into the required format.

Requirements

Consistent Input/Output: Input and output types should match, e.g., a series to a series for DataFrame processing.
Link Handling: Fetch data from provided links (e.g., cloud storage) and encode it in base64.
Column Consistency: The processed data must align with the columns referenced in the template.

Note: The data processor processes individual rows or items at a time:

When using a DataFrame, the processor should handle one series (row) at a time.
When using a list, the processor should handle one string (item) at a time.

Example: Processing Audio Links

The following is an example of a data processor that fetches audio from Google Cloud Storage, encodes it as base64, and assigns it to the appropriate column:

async def async_fetch_gcloud_data(row: pd.Series) -> pd.Series:
    """
    Fetches data from Google Cloud Storage and returns the content as a base64-encoded string.
    """
    token = None
    try:
        # Fetch the Google Cloud access token
        output = await asyncio.create_subprocess_exec(
            "gcloud", "auth", "print-access-token",
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
        )
        stdout, stderr = await output.communicate()
        if output.returncode != 0:
            raise RuntimeError(f"Error: {stderr.decode().strip()}")
        token = stdout.decode().strip()
        if not token:
            raise ValueError("Failed to retrieve a valid access token.")
    except Exception as e:
        raise RuntimeError(f"Unexpected error: {str(e)}")

    headers = {"Authorization": f"Bearer {token}"}
    url = row["attributes.input.audio.url"]
    if url.startswith("gs://"):
        url = url.replace("gs://", "https://storage.googleapis.com/")

    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as response:
            response.raise_for_status()
            content = await response.read()

    row["audio"] = base64.b64encode(content).decode("utf-8")
    return row

If your data is already base64-encoded, you can skip that step.

Performing the Evaluation

To run an evaluation, use the llm_classify function.

from phoenix.evals.classify import llm_classify

# Run the evaluation
results = llm_classify(
    model=model,
    data=df,
    data_processor=async_fetch_gcloud_data,  # Optional, for private links
    template=EMOTION_PROMPT_TEMPLATE,
    rails=EMOTION_AUDIO_RAILS,
    provide_explanation=True,  # Enable explanations
)

Build an Eval

This guide shows you how to build and improve an LLM as a Judge Eval from scratch.

Before you begin:

You'll need two things to build your own LLM Eval:

A dataset to evaluate
A template prompt to use as the evaluation prompt on each row of data.

The dataset can have any columns you like, and the template can be structured however you like. The only requirement is that the dataset has all the columns your template uses.

We have two examples of templates below: CATEGORICAL_TEMPLATE and SCORE_TEMPLATE. The first must be used alongside a dataset with columns query and reference. The second must be used with a dataset that includes a column called context.

Feel free to set up your template however you'd like to match your dataset.

Preparing your data

trace_df = px.Client(endpoint="http://127.0.0.1:6006").get_spans_dataframe()

If your eval should have categorical outputs, use llm_classify.

If your eval should have numeric outputs, use llm_generate.

Categorical - llm_classify

The llm_classify function is designed for classification support both Binary and Multi-Class. The llm_classify function ensures that the output is clean and is either one of the "classes" or "UNPARSABLE"

A binary template looks like the following with only two values "irrelevant" and "relevant" that are expected from the LLM output:

CATEGORICAL_TEMPLATE = ''' You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {query}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "irrelevant",
and should not contain any text or characters aside from that word.
"irrelevant" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question. '''

The categorical template defines the expected output of the LLM, and the rails define the classes expected from the LLM:

irrelevant
relevant

from phoenix.evals import (
    llm_classify,
    OpenAIModel # see https://arize.com/docs/phoenix/evaluation/evaluation-models
    # for a full list of supported models
)

# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = ["irrelevant", "relevant"]
#MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"]
relevance_classifications = llm_classify(
    dataframe=<YOUR_DATAFRAME_GOES_HERE>,
    template=CATEGORICAL_TEMPLATE,
    model=OpenAIModel('gpt-4o', api_key=''),
    rails=rails
)

Snap to Rails Function

The classify uses a snap_to_rails function that searches the output string of the LLM for the classes in the classification list. It handles cases where no class is available, both classes are available or the string is a substring of the other class such as irrelevant and relevant.

#Rails examples
#Removes extra information and maps to class
llm_output_string = "The answer is relevant...!"
> "relevant"

#Removes "." and capitalization from LLM output and maps to class
llm_output_string = "Irrelevant."
>"irrelevant"

#No class in resposne
llm_output_string = "I am not sure!"
>"UNPARSABLE"

#Both classes in response
llm_output_string = "The answer is relevant i think, or maybe irrelevant...!"
>"UNPARSABLE"

A common use case is mapping the class to a 1 or 0 numeric value.

Numeric - llm_generate

The Phoenix library does support numeric score Evals if you would like to use them. A template for a score Eval looks like the following:

SCORE_TEMPLATE = """
You are a helpful AI bot that checks for grammatical, spelling and typing errors
in a document context. You are going to return a continous score for the
document based on the percent of grammatical and typing errors. The score should be
between 10 and 1. A score of 1 will be no grammatical errors in any word,
a score of 2 will be 20% of words have errors, a 5 score will be 50% errors,
a score of 7 is 70%, and a 10 score will be all words in the context have a
grammatical errors.

The following is the document context.

#CONTEXT
{context}
#ENDCONTEXT

#QUESTION
Please return a score between 10 and 1.
You will return no other text or language besides the score. Only return the score.
Please return in a format that is "the score is: 10" or "the score is: 1"
"""

We use the more generic llm_generate function that can be used for almost any complex eval that doesn't fit into the categorical type.

from phoenix.evals import (
    llm_generate,
    OpenAIModel # see https://arize.com/docs/phoenix/evaluation/evaluation-models
    # for a full list of supported models
)

test_results = llm_generate(
    dataframe=<YOUR_DATAFRAME_GOES_HERE>,
    template=SCORE_TEMPLATE,
    model=OpenAIModel(model='gpt-4o', api_key=''),
    verbose=True,
    # Callback function that will be called for each row of the dataframe
    output_parser=numeric_score_eval,
    # These two flags will add the prompt / response to the returned dataframe
    include_prompt=True,
    include_response=True,
)

def numeric_score_eval(output, row_index):
    # This is the function that will be called for each row of the 
    # dataframe after the eval is run
    row = df.iloc[row_index]
    score = self.find_score(output)

    return {"score": score}

def find_score(self, output):
    # Regular expression pattern
    # It looks for 'score is', followed by any characters (.*?), and then a float or integer
    pattern = r"score is.*?([+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)"

    match = re.search(pattern, output, re.IGNORECASE)
    if match:
        # Extract and return the number
        return float(match.group(1))
    else:
        return None

The above is an example of how to run a score based Evaluation.

Logging Evaluations to Phoenix

In order for the results to show in Phoenix, make sure your test_results dataframe has a column context.span_id with the corresponding span id. This value comes from Phoenix when you export traces from the platform. If you've brought in your own dataframe to evaluate, this section does not apply.

Log Evals to Phoenix

Use the following method to log the results of either the llm_classify or llm_generate calls to Phoenix:

from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Your Eval Display Name", dataframe=test_results)
)

This method will show aggregate results in Phoenix.

Improving your Custom Eval

At this point, you've constructed a custom Eval, but you have no understanding of how accurate that Eval is. To test your eval, you can use the same techniques that you use to iterate and improve on your application.

Start with a labeled ground truth set of data. Each input would be a row of your dataframe of examples, and each labeled output would be the correct judge label
Test your eval on that labeled set of examples, and compare to the ground truth to calculate F1, precision, and recall scores. For an example of this, see Hallucinations
Tweak your prompt and retest. See https://github.com/Arize-ai/phoenix/blob/docs/docs/evaluation/how-to-evals/broken-reference/README.md for an example of how to do this in an automated way.

Add Attributes, Metadata, Users

Using context to customize spans

Supported Context Attributes include:

Session ID* Unique identifier for a session
User ID* Unique identifier for a user.
Metadata Metadata associated with a span.
Tags* List of tags to give the span a category.
Prompt Template
- Template Used to generate prompts as Python f-strings.
- Version The version of the prompt template.
- Variables key-value pairs applied to the prompt template.

Install Core Instrumentation Package

Specifying a session

from openinference.instrumentation import using_session

with using_session(session_id="my-session-id"):
    # Calls within this block will generate spans with the attributes:
    # "session.id" = "my-session-id"
    ...

It can also be used as a decorator:

@using_session(session_id="my-session-id")
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "session.id" = "my-session-id"
    ...

import { context } from "@opentelemetry/api"
import { setSession } from "@openinference-core"

context.with(
  setSession(context.active(), { sessionId: "session-id" }),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "session.id" = "session-id"
  }
)

Specifying users

from openinference.instrumentation import using_user
with using_user("my-user-id"):
    # Calls within this block will generate spans with the attributes:
    # "user.id" = "my-user-id"
    ...

It can also be used as a decorator:

@using_user("my-user-id")
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "user.id" = "my-user-id"
    ...

import { context } from "@opentelemetry/api"
import { setUser } from "@openinference-core"

context.with(
  setUser(context.active(), { userId: "user-id" }),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "user.id" = "user-id"
  }
)

Specifying Metadata

from openinference.instrumentation import using_metadata
metadata = {
    "key-1": value_1,
    "key-2": value_2,
    ...
}
with using_metadata(metadata):
    # Calls within this block will generate spans with the attributes:
    # "metadata" = "{\"key-1\": value_1, \"key-2\": value_2, ... }" # JSON serialized
    ...

It can also be used as a decorator:

@using_metadata(metadata)
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "metadata" = "{\"key-1\": value_1, \"key-2\": value_2, ... }" # JSON serialized
    ...

import { context } from "@opentelemetry/api"
import { setMetadata } from "@openinference-core"

context.with(
  setMetadata(context.active(), { key1: "value1", key2: "value2" }),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "metadata" = '{"key1": "value1", "key2": "value2"}'
  }
)

Specifying Tags

from openinference.instrumentation import using_tags
tags = ["tag_1", "tag_2", ...]
with using_tags(tags):
    # Calls within this block will generate spans with the attributes:
    # "tag.tags" = "["tag_1","tag_2",...]"
    ...

It can also be used as a decorator:

@using_tags(tags)
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "tag.tags" = "["tag_1","tag_2",...]"
    ...

import { context } from "@opentelemetry/api"
import { setTags } from "@openinference-core"

context.with(
  setTags(context.active(), ["value1", "value2"]),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "tag.tags" = '["value1", "value2"]'
  }
)

Customizing Attributes

from openinference.instrumentation import using_attributes
tags = ["tag_1", "tag_2", ...]
metadata = {
    "key-1": value_1,
    "key-2": value_2,
    ...
}
prompt_template = "Please describe the weather forecast for {city} on {date}"
prompt_template_variables = {"city": "Johannesburg", "date":"July 11"}
prompt_template_version = "v1.0"
with using_attributes(
    session_id="my-session-id",
    user_id="my-user-id",
    metadata=metadata,
    tags=tags,
    prompt_template=prompt_template,
    prompt_template_version=prompt_template_version,
    prompt_template_variables=prompt_template_variables,
):
    # Calls within this block will generate spans with the attributes:
    # "session.id" = "my-session-id"
    # "user.id" = "my-user-id"
    # "metadata" = "{\"key-1\": value_1, \"key-2\": value_2, ... }" # JSON serialized
    # "tag.tags" = "["tag_1","tag_2",...]"
    # "llm.prompt_template.template" = "Please describe the weather forecast for {city} on {date}"
    # "llm.prompt_template.variables" = "{\"city\": \"Johannesburg\", \"date\": \"July 11\"}" # JSON serialized
    # "llm.prompt_template.version " = "v1.0"
    ...

The previous example is equivalent to doing the following, making using_attributes a very convenient tool for the more complex settings.

with (
    using_session("my-session-id"),
    using_user("my-user-id"),
    using_metadata(metadata),
    using_tags(tags),
    using_prompt_template(
        template=prompt_template,
        version=prompt_template_version,
        variables=prompt_template_variables,
    ),
):
    # Calls within this block will generate spans with the attributes:
    # "session.id" = "my-session-id"
    # "user.id" = "my-user-id"
    # "metadata" = "{\"key-1\": value_1, \"key-2\": value_2, ... }" # JSON serialized
    # "tag.tags" = "["tag_1","tag_2",...]"
    # "llm.prompt_template.template" = "Please describe the weather forecast for {city} on {date}"
    # "llm.prompt_template.variables" = "{\"city\": \"Johannesburg\", \"date\": \"July 11\"}" # JSON serialized
    # "llm.prompt_template.version " = "v1.0"
    ...

It can also be used as a decorator:

@using_attributes(
    session_id="my-session-id",
    user_id="my-user-id",
    metadata=metadata,
    tags=tags,
    prompt_template=prompt_template,
    prompt_template_version=prompt_template_version,
    prompt_template_variables=prompt_template_variables,
)
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "session.id" = "my-session-id"
    # "user.id" = "my-user-id"
    # "metadata" = "{\"key-1\": value_1, \"key-2\": value_2, ... }" # JSON serialized
    # "tag.tags" = "["tag_1","tag_2",...]"
    # "llm.prompt_template.template" = "Please describe the weather forecast for {city} on {date}"
    # "llm.prompt_template.variables" = "{\"city\": \"Johannesburg\", \"date\": \"July 11\"}" # JSON serialized
    # "llm.prompt_template.version " = "v1.0"
    ...

import { context } from "@opentelemetry/api"
import { setAttributes } from "@openinference-core"

context.with(
  setAttributes(context.active(), { myAttribute: "test" }),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "myAttribute" = "test"
  }
)

You can also use multiple setters at the same time to propagate multiple attributes to the span below. Since each setter function returns a new context, they can be used together as follows.

import { context } from "@opentelemetry/api"
import { setAttributes } from "@openinference-core"

context.with(
  setAttributes(
    setSession(context.active(), { sessionId: "session-id"}),
    { myAttribute: "test" }
  ),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "myAttribute" = "test"
      // "session.id" = "session-id"
  }
)

import { context } from "@opentelemetry/api"
import { setAttributes } from "@openinference-core"
import { SemanticConventions } from "@arizeai/openinference-semantic-conventions";


context.with(
  setAttributes(
    { [SemanticConventions.SESSION_ID: "session-id" }
  ),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "session.id" = "session-id"
  }
)

Span Processing

The tutorials and code snippets in these docs default to the SimpleSpanProcessor. A SimpleSpanProcessor processes and exports spans as they are created. This means that if you create 5 spans, each will be processed and exported before the next span is created in code. This can be helpful in scenarios where you do not want to risk losing a batch, or if you’re experimenting with OpenTelemetry in development. However, it also comes with potentially significant overhead, especially if spans are being exported over a network - each time a call to create a span is made, it would be processed and sent over a network before your app’s execution could continue.

The BatchSpanProcessor processes spans in batches before they are exported. This is usually the right processor to use for an application in production but it does mean spans may take some time to show up in Phoenix.

In production we recommend the BatchSpanProcessor over SimpleSpanProcessor when deployed and the SimpleSpanProcessor when developing.

BatchSpanProcessor Example - Using arize-phoenix-otel library

from phoenix.otel import register

# configure the Phoenix tracer for batch processing
tracer_provider = register(
  project_name="my-llm-app", # Default is 'default'
  batch=True, # Default is 'False'
)

BatchSpanProcessor Example - Using OTel library

from opentelemetry.sdk.trace.export import SimpleSpanProcessor, BatchSpanProcessor

tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint)))

Eval Models

Evaluation model classes powering your LLM Evals

Supported LLM Providers

We currently support the following LLM providers under phoenix.evals:

LLM Wrappers

OpenAIModel

Need to install the extra dependencies openai>=1.0.0

class OpenAIModel:
    api_key: Optional[str] = field(repr=False, default=None)
    """Your OpenAI key. If not provided, will be read from the environment variable"""
    organization: Optional[str] = field(repr=False, default=None)
    """
    The organization to use for the OpenAI API. If not provided, will default
    to what's configured in OpenAI
    """
    base_url: Optional[str] = field(repr=False, default=None)
    """
    An optional base URL to use for the OpenAI API. If not provided, will default
    to what's configured in OpenAI
    """
    model: str = "gpt-4"
    """Model name to use. In of azure, this is the deployment name such as gpt-35-instant"""
    temperature: float = 0.0
    """What sampling temperature to use."""
    max_tokens: int = 256
    """The maximum number of tokens to generate in the completion.
    -1 returns as many tokens as possible given the prompt and
    the models maximal context size."""
    top_p: float = 1
    """Total probability mass of tokens to consider at each step."""
    frequency_penalty: float = 0
    """Penalizes repeated tokens according to frequency."""
    presence_penalty: float = 0
    """Penalizes repeated tokens."""
    n: int = 1
    """How many completions to generate for each prompt."""
    model_kwargs: Dict[str, Any] = field(default_factory=dict)
    """Holds any model parameters valid for `create` call not explicitly specified."""
    batch_size: int = 20
    """Batch size to use when passing multiple documents to generate."""
    request_timeout: Optional[Union[float, Tuple[float, float]]] = None
    """Timeout for requests to OpenAI completion API. Default is 600 seconds."""

All models newer than GPT 3.5 Turbo are tested regularly. If you're using an older model than that, you may run into deprecated API parameters.

To authenticate with OpenAI you will need, at a minimum, an API key. The model class will look for it in your environment, or you can pass it via argument as shown above. In addition, you can choose the specific name of the model you want to use and its configuration parameters. The default values specified above are common default values from OpenAI. Quickly instantiate your model as follows:

model = OpenAI()
model("Hello there, this is a test if you are working?")
# Output: "Hello! I'm working perfectly. How can I assist you today?"

Azure OpenAI

The code snippet below shows how to initialize OpenAIModel for Azure:

model = OpenAIModel(
    model="gpt-35-turbo-16k",
    azure_endpoint="https://arize-internal-llm.openai.azure.com/",
    api_version="2023-09-15-preview",
)

Note that the model param is actually the engine of your deployment. You may get a DeploymentNotFound error if this parameter is not correct. You can find your engine param in the Azure OpenAI playground. \

Azure OpenAI supports specific options:

api_version: str = field(default=None)
"""
The verion of the API that is provisioned
https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
"""
azure_endpoint: Optional[str] = field(default=None)
"""
The endpoint to use for azure openai. Available in the azure portal.
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
"""
azure_deployment: Optional[str] = field(default=None)
azure_ad_token: Optional[str] = field(default=None)
azure_ad_token_provider: Optional[Callable[[], str]] = field(default=None)

Find more about the functionality available in our EvalModels in the Usage section.

VertexAI

Need to install the extra dependencygoogle-cloud-aiplatform>=1.33.0

class VertexAIModel:
    project: Optional[str] = None
    location: Optional[str] = None
    credentials: Optional["Credentials"] = None
    model: str = "text-bison"
    tuned_model: Optional[str] = None
    temperature: float = 0.0
    max_tokens: int = 256
    top_p: float = 0.95
    top_k: int = 40

To authenticate with VertexAI, you must pass either your credentials or a project, location pair. In the following example, we quickly instantiate the VertexAI model as follows:

project = "my-project-id"
location = "us-central1" # as an example
model = VertexAIModel(project=project, location=location)
model("Hello there, this is a tesst if you are working?")
# Output: "Hello world, I am working!"

GeminiModel

class GeminiModel:
    project: Optional[str] = None
    location: Optional[str] = None
    credentials: Optional["Credentials"] = None
    model: str = "gemini-pro"
    default_concurrency: int = 5
    temperature: float = 0.0
    max_tokens: int = 256
    top_p: float = 1
    top_k: int = 32

Similar to VertexAIModel above for authentication

AnthropicModel

class AnthropicModel(BaseModel):
    model: str = "claude-2.1"
    """The model name to use."""
    temperature: float = 0.0
    """What sampling temperature to use."""
    max_tokens: int = 256
    """The maximum number of tokens to generate in the completion."""
    top_p: float = 1
    """Total probability mass of tokens to consider at each step."""
    top_k: int = 256
    """The cutoff where the model no longer selects the words."""
    stop_sequences: List[str] = field(default_factory=list)
    """If the model encounters a stop sequence, it stops generating further tokens."""
    extra_parameters: Dict[str, Any] = field(default_factory=dict)
    """Any extra parameters to add to the request body (e.g., countPenalty for a21 models)"""
    max_content_size: Optional[int] = None
    """If you're using a fine-tuned model, set this to the maximum content size"""

BedrockModel

class BedrockModel:
    model_id: str = "anthropic.claude-v2"
    """The model name to use."""
    temperature: float = 0.0
    """What sampling temperature to use."""
    max_tokens: int = 256
    """The maximum number of tokens to generate in the completion."""
    top_p: float = 1
    """Total probability mass of tokens to consider at each step."""
    top_k: int = 256
    """The cutoff where the model no longer selects the words"""
    stop_sequences: List[str] = field(default_factory=list)
    """If the model encounters a stop sequence, it stops generating further tokens. """
    session: Any = None
    """A bedrock session. If provided, a new bedrock client will be created using this session."""
    client = None
    """The bedrock session client. If unset, a new one is created with boto3."""
    max_content_size: Optional[int] = None
    """If you're using a fine-tuned model, set this to the maximum content size"""
    extra_parameters: Dict[str, Any] = field(default_factory=dict)
    """Any extra parameters to add to the request body (e.g., countPenalty for a21 models)"""

To Authenticate, the following code is used to instantiate a session and the session is used with Phoenix Evals

import boto3

# Create a Boto3 session
session = boto3.session.Session(
    aws_access_key_id='ACCESS_KEY',
    aws_secret_access_key='SECRET_KEY',
    region_name='us-east-1'  # change to your preferred AWS region
)

#If you need to assume a role
# Creating an STS client
sts_client = session.client('sts')

# (optional - if needed) Assuming a role
response = sts_client.assume_role(
    RoleArn="arn:aws:iam::......",
    RoleSessionName="AssumeRoleSession1",
    #(optional) if MFA Required
    SerialNumber='arn:aws:iam::...',
    #Insert current token, needs to be run within x seconds of generation
    TokenCode='PERIODIC_TOKEN'
)

# Your temporary credentials will be available in the response dictionary
temporary_credentials = response['Credentials']

# Creating a new Boto3 session with the temporary credentials
assumed_role_session = boto3.Session(
    aws_access_key_id=temporary_credentials['AccessKeyId'],
    aws_secret_access_key=temporary_credentials['SecretAccessKey'],
    aws_session_token=temporary_credentials['SessionToken'],
    region_name='us-east-1'
)

client_bedrock = assumed_role_session.client("bedrock-runtime")
# Arize Model Object - Bedrock ClaudV2 by default
model = BedrockModel(client=client_bedrock)

MistralAIModel

Need to install extra dependency mistralai

```python
class MistralAIModel(BaseModel):
    model: str = "mistral-large-latest"
    temperature: float = 0
    top_p: Optional[float] = None
    random_seed: Optional[int] = None
    response_format: Optional[Dict[str, str]] = None
    safe_mode: bool = False
    safe_prompt: bool = False

LiteLLMModel

Need to install the extra dependency litellm>=1.0.3

class LiteLLMModel(BaseEvalModel):
    model: str = "gpt-3.5-turbo"
    """The model name to use."""
    temperature: float = 0.0
    """What sampling temperature to use."""
    max_tokens: int = 256
    """The maximum number of tokens to generate in the completion."""
    top_p: float = 1
    """Total probability mass of tokens to consider at each step."""
    num_retries: int = 6
    """Maximum number to retry a model if an RateLimitError, OpenAIError, or
    ServiceUnavailableError occurs."""
    request_timeout: int = 60
    """Maximum number of seconds to wait when retrying."""
    model_kwargs: Dict[str, Any] = field(default_factory=dict)
    """Model specific params"""

Here is an example of how to initialize LiteLLMModel for llama3 using ollama.

import os

from phoenix.evals import LiteLLMModel

os.environ["OLLAMA_API_BASE"] = "http://localhost:11434"

model = LiteLLMModel(model="ollama/llama3")

Usage

In this section, we will showcase the methods and properties that our EvalModels have. First, instantiate your model from theSupported LLM Providers. Once you've instantiated your model, you can get responses from the LLM by simply calling the model and passing a text string.

# model = Instantiate your model here
model("Hello there, how are you?")
# Output: "As an artificial intelligence, I don't have feelings, 
#          but I'm here and ready to assist you. How can I help you today?"

Evals API Reference

Evals are LLM-powered functions that you can use to evaluate the output of your LLM or generative application

phoenix.evals.run_evals

Evaluates a pandas dataframe using a set of user-specified evaluators that assess each row for relevance of retrieved documents, hallucinations, toxicity, etc. Outputs a list of dataframes, one for each evaluator, that contain the labels, scores, and optional explanations from the corresponding evaluator applied to the input dataframe.

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents an individual record to be evaluated. Each evaluator uses an LLM and an evaluation prompt template to assess the rows of the dataframe, and those template variables must appear as column names in the dataframe.
- HallucinationEvaluator: Evaluates whether a response (stored under an "output" column) is a hallucination given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).
- RelevanceEvaluator: Evaluates whether a retrieved document (stored under a "reference" column) is relevant or irrelevant to the corresponding query (stored under an "input" column).
- ToxicityEvaluator: Evaluates whether a string (stored under an "input" column) contains racist, sexist, chauvinistic, biased, or otherwise toxic content.
- QAEvaluator: Evaluates whether a response (stored under an "output" column) is correct or incorrect given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).
- SummarizationEvaluator: Evaluates whether a summary (stored under an "output" column) provides an accurate synopsis of an input document (stored under an "input" column).
- SQLEvaluator: Evaluates whether a generated SQL query (stored under the "query_gen" column) and a response (stored under the "response" column) appropriately answer a question (stored under the "question" column).
provide_explanation (bool, optional): If true, each output dataframe will contain an explanation column containing the LLM's reasoning for each evaluation.
use_function_calling_if_available (bool, optional): If true, function calling is used (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.
verbose (bool, optional): If true, prints detailed information such as model invocation parameters, retries on failed requests, etc.
concurrency (int, optional): The number of concurrent workers if async submission is possible. If not provided, a recommended default concurrency is set on a per-model basis.

Returns

List[pandas.DataFrame]: A list of dataframes, one for each evaluator, all of which have the same number of rows as the input dataframe.

Usage

To use run_evals, you must first wrangle your LLM application data into a pandas dataframe either manually or by querying and exporting the spans collected by your Phoenix session. Once your dataframe is wrangled into the appropriate format, you can instantiate your evaluators by passing the model to be used during evaluation.

Run your evaluations by passing your dataframe and your list of desired evaluators.

Assuming your dataframe contains the "input", "reference", and "output" columns required by HallucinationEvaluator and QAEvaluator, your output dataframes should contain the results of the corresponding evaluator applied to the input dataframe, including columns for labels (e.g., "factual" or "hallucinated"), scores (e.g., 0 for factual labels, 1 for hallucinated labels), and explanations. If your dataframe was exported from your Phoenix session, you can then ingest the evaluations using phoenix.log_evaluations so that the evals will be visible as annotations inside Phoenix.

phoenix.evals.PromptTemplate

Class used to store and format prompt templates.

Parameters

text (str): The raw prompt text used as a template.
delimiters (List[str]): List of characters used to locate the variables within the prompt template text. Defaults to ["{", "}"].

Attributes

text (str): The raw prompt text used as a template.
variables (List[str]): The names of the variables that, once their values are substituted into the template, create the prompt text. These variable names are automatically detected from the template text using the delimiters passed when initializing the class (see Usage section below).

Usage

Define a PromptTemplate by passing a text string and the delimiters to use to locate the variables. The default delimiters are { and }.

If the prompt template variables have been correctly located, you can access them as follows:

The PromptTemplate class can also understand any combination of delimiters. Following the example above, but getting creative with our delimiters:

Once you have a PromptTemplate class instantiated, you can make use of its format method to construct the prompt text resulting from substituting values into the variables. To do so, a dictionary mapping the variable names to the values is passed:

Note that once you initialize the PromptTemplate class, you don't need to worry about delimiters anymore, it will be handled for you.

phoenix.evals.llm_classify

Classifies each input row of the dataframe using an LLM. Returns a pandas.DataFrame where the first column is named label and contains the classification labels. An optional column named explanation is added when provide_explanation=True.

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be classified. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (ClassificationTemplate, or str): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to .format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class instance
rails (List[str]): A list of strings representing the possible output classes of the model's predictions.
system_instruction (Optional[str]): An optional system message for modals that support it
verbose (bool, optional): If True, prints detailed info to stdout such as model invocation parameters and details about retries and snapping to rails. Default False.
use_function_calling_if_available (bool, default=True): If True, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.

Returns

pandas.DataFrame: A dataframe where the label column (at column position 0) contains the classification labels. If provide_explanation=True, then an additional column named explanation is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or "NOT_PARSABLE" if the model's output could not be parsed.

phoenix.evals.llm_generate

Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be used as in input to the template. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (Union[PromptTemplate, str]): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class.
system_instruction (Optional[str], optional): An optional system message.
output_parser (Callable[[str, int], Dict[str, Any]], optional): An optional function that takes each generated response and response index and parses it to a dictionary. The keys of the dictionary should correspond to the column names of the output dataframe. If None, the output dataframe will have a single column named "output". Default None.

Returns

generations_dataframe (pandas.DataFrame): A dataframe where each row represents the generated output

Usage

Below we show how you can use llm_generate to use an llm to generate synthetic data. In this example, we use the llm_generate function to generate the capitals of countries but llm_generate can be used to generate any type of data such as synthetic questions, irrelevant responses, and so on.

llm_generate also supports an output parser so you can use this to generate data in a structured format. For example, if you want to generate data in JSON format, you ca prompt for a JSON object and then parse the output using the json library.

Setup Tracing (TS)

Getting Started

Instrumentation is the act of adding observability code to an app yourself.

If you’re instrumenting an app, you need to use the OpenTelemetry SDK for your language. You’ll then use the SDK to initialize OpenTelemetry and the API to instrument your code. This will emit telemetry from your app, and any library you installed that also comes with instrumentation.

Now lets walk through instrumenting, and then tracing, a sample express application.

instrumentation setup

Dependencies

Install OpenTelemetry API packages:

# npm, pnpm, yarn, etc
npm install @opentelemetry/semantic-conventions @opentelemetry/api @opentelemetry/instrumentation @opentelemetry/resources @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-proto

Install OpenInference instrumentation packages. Below is an example of adding instrumentation for OpenAI as well as the semantic conventions for OpenInference.

# npm, pnpm, yarn, etc
npm install openai @arizeai/openinference-instrumentation-openai @arizeai/openinference-semantic-conventions

Traces

Initialize Tracing

If a TracerProvider is not created, the OpenTelemetry APIs for tracing will use a no-op implementation and fail to generate data. As explained next, create an instrumentation.ts (or instrumentation.js) file to include all of the provider initialization code in Node.

Node.js

Create instrumentation.ts (or instrumentation.js) to contain all the provider initialization code:

// instrumentation.ts
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-proto";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
import { SEMRESATTRS_PROJECT_NAME } from "@arizeai/openinference-semantic-conventions";
import OpenAI from "openai";

// For troubleshooting, set the log level to DiagLogLevel.DEBUG
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

const tracerProvider = new NodeTracerProvider({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "openai-service",
    // Project name in Phoenix, defaults to "default"
    [SEMRESATTRS_PROJECT_NAME]: "openai-service",
  }),
  spanProcessors: [
    // BatchSpanProcessor will flush spans in batches after some time,
    // this is recommended in production. For development or testing purposes
    // you may try SimpleSpanProcessor for instant span flushing to the Phoenix UI.
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: `http://localhost:6006/v1/traces`,
        // (optional) if connecting to Phoenix Cloud
        // headers: { "api_key": process.env.PHOENIX_API_KEY },
        // (optional) if connecting to self-hosted Phoenix with Authentication enabled
        // headers: { "Authorization": `Bearer ${process.env.PHOENIX_API_KEY}` }
      })
    ),
  ],
});
tracerProvider.register();

const instrumentation = new OpenAIInstrumentation();
instrumentation.manuallyInstrument(OpenAI);

registerInstrumentations({
  instrumentations: [instrumentation],
});

console.log("👀 OpenInference initialized");

This basic setup has will instrument chat completions via native calls to the OpenAI client.

Picking the right span processor

In our instrumentation.ts file above, we use the BatchSpanProcessor. The BatchSpanProcessor processes spans in batches before they are exported. This is usually the right processor to use for an application.

In contrast, the SimpleSpanProcessor processes spans as they are created. This means that if you create 5 spans, each will be processed and exported before the next span is created in code. This can be helpful in scenarios where you do not want to risk losing a batch, or if you’re experimenting with OpenTelemetry in development. However, it also comes with potentially significant overhead, especially if spans are being exported over a network - each time a call to create a span is made, it would be processed and sent over a network before your app’s execution could continue.

In most cases, stick with BatchSpanProcessor over SimpleSpanProcessor.

Tracing instrumented libraries

Now that you have configured a tracer provider, and instrumented the openai package, lets see how we can generate traces for a sample application.

First, install the dependencies required for our sample app.

# npm, pnpm, yarn, etc
npm install express

Next, create an app.ts (or app.js ) file, that hosts a simple express server for executing OpenAI chat completions.

// app.ts
import express from "express";
import OpenAI from "openai";

const PORT: number = parseInt(process.env.PORT || "8080");
const app = express();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.get("/chat", async (req, res) => {
  const message = req.query.message;
  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: "user", content: message }],
    model: "gpt-4o",
  });
  res.send(chatCompletion.choices[0].message.content);
});

app.listen(PORT, () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Then, we will start our application, loading the instrumentation.ts file before app.ts so that our instrumentation code can instrument openai .

# node v23
node --require ./instrumentation.ts app.ts

We are using Node v23 above as this allows us to execute TypeScript code without a transpilation step. OpenTelemetry and OpenInference support Node versions from v18 onwards, and we are flexible with projects configured using CommonJS or ESM module syntaxes.

Finally, we can execute a request against our server

curl "http://localhost:8080/chat?message=write%20me%20a%20haiku"

After a few moments, a new project openai-service will appear in the Phoenix UI, along with the trace generated by our OpenAI chat completion!

Advanced: Manually Tracing

Acquiring a tracer

Anywhere in your application where you write manual tracing code should call getTracer to acquire a tracer. For example:

import opentelemetry from '@opentelemetry/api';
//...

const tracer = opentelemetry.trace.getTracer(
  'instrumentation-scope-name',
  'instrumentation-scope-version',
);

// You can now use a 'tracer' to do tracing!

It’s generally recommended to call getTracer in your app when you need it rather than exporting the tracer instance to the rest of your app. This helps avoid trickier application load issues when other required dependencies are involved.

Below is an example of acquiring a tracer within application scope.

// app.ts
import { trace } from '@opentelemetry/api';
import express from 'express';
import OpenAI from "openai";

const tracer = trace.getTracer('llm-server', '0.1.0');

const PORT: number = parseInt(process.env.PORT || "8080");
const app = express();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.get("/chat", async (req, res) => {
  const message = req.query.message;
  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: "user", content: message }],
    model: "gpt-4o",
  });
  res.send(chatCompletion.choices[0].message.content);
});

app.listen(PORT, () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Create spans

The API of OpenTelemetry JavaScript exposes two methods that allow you to create spans:

In most cases you want to use the latter (tracer.startActiveSpan), as it takes care of setting the span and its context active.

The code below illustrates how to create an active span.

import { trace, Span } from "@opentelemetry/api";
import { SpanKind } from "@opentelemetry/api";
import {
    SemanticConventions,
    OpenInferenceSpanKind,
} from "@arizeai/openinference-semantic-conventions";

export function chat(message: string) {
    // Create a span. A span must be closed.
    return tracer.startActiveSpan(
        "chat",
        (span: Span) => {
            span.setAttributes({
                [SemanticConventions.OPENINFERENCE_SPAN_KIND]: OpenInferenceSpanKind.chain,
                [SemanticConventions.INPUT_VALUE]: message,
            });
            let chatCompletion = await openai.chat.completions.create({
                messages: [{ role: "user", content: message }],
                model: "gpt-3.5-turbo",
            });
            span.setAttributes({
                attributes: {
                    [SemanticConventions.OUTPUT_VALUE]: chatCompletion.choices[0].message,
                },
            });
            // Be sure to end the span!
            span.end();
            return result;
        }
    );
}

The above instrumented code can now be pasted in the /chat handler. You should now be able to see spans emitted from your app.

Start your app as follows, and then send it requests by visiting http://localhost:8080/chat?message="how long is a pencil" with your browser or curl.

ts-node --require ./instrumentation.ts app.ts

After a while, you should see the spans printed in the console by the ConsoleSpanExporter, something like this:

{
  "traceId": "6cc927a05e7f573e63f806a2e9bb7da8",
  "parentId": undefined,
  "name": "chat",
  "id": "117d98e8add5dc80",
  "kind": 0,
  "timestamp": 1688386291908349,
  "duration": 501,
  "attributes": {
    "openinference.span.kind": "chain"
    "input.value": "how long is a pencil"
  },
  "status": { "code": 0 },
  "events": [],
  "links": []
}

Get the current span

const activeSpan = opentelemetry.trace.getActiveSpan();

// do something with the active span, optionally ending it if that is appropriate for your use case.

Get a span from context

const ctx = getContextFromSomewhere();
const span = opentelemetry.trace.getSpan(ctx);

// do something with the acquired span, optionally ending it if that is appropriate for your use case.

Attributes

function chat(message: string, user: User) {
  return tracer.startActiveSpan(`chat:${i}`, (span: Span) => {
    const result = Math.floor(Math.random() * (max - min) + min);

    // Add an attribute to the span
    span.setAttribute('mycompany.userid', user.id);

    span.end();
    return result;
  });
}

You can also add attributes to a span as it’s created:

tracer.startActiveSpan(
  'app.new-span',
  { attributes: { attribute1: 'value1' } },
  (span) => {
    // do some work...

    span.end();
  },
);

function chat(session: Session) {
  return tracer.startActiveSpan(
    'chat',
    { attributes: { 'mycompany.sessionid': session.id } },
    (span: Span) => {
      /* ... */
    },
  );
}

Semantic Attributes

First add both semantic conventions as a dependency to your application:

npm install --save @opentelemetry/semantic-conventions @arizeai/openinfernece-semantic-conventions

Add the following to the top of your application file:

import { SemanticAttributes } from 'arizeai/openinfernece-semantic-conventions';

Finally, you can update your file to include semantic attributes:

const doWork = () => {
  tracer.startActiveSpan('app.doWork', (span) => {
    span.setAttribute(SemanticAttributes.INPUT_VALUE, 'work input');
    // Do some work...

    span.end();
  });
};

Span events

span.addEvent('Doing something');

const result = doWork();

While Phoenix captures these, they are currently not displayed in the UI. Contact us if you would like to support!

span.addEvent('some log', {
  'log.severity': 'error',
  'log.message': 'Data not found',
  'request.id': requestId,
});

Span Status

The status can be set at any time before the span is finished.

import opentelemetry, { SpanStatusCode } from '@opentelemetry/api';

// ...

tracer.startActiveSpan('app.doWork', (span) => {
  for (let i = 0; i <= Math.floor(Math.random() * 40000000); i += 1) {
    if (i > 10000) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: 'Error',
      });
    }
  }

  span.end();
});

Recording exceptions

import opentelemetry, { SpanStatusCode } from '@opentelemetry/api';

// ...

try {
  doWork();
} catch (ex) {
  span.recordException(ex);
  span.setStatus({ code: SpanStatusCode.ERROR });
}

Using `sdk-trace-base` and manually propagating span context

In some cases, you may not be able to use either the Node.js SDK nor the Web SDK. The biggest difference, aside from initialization code, is that you’ll have to manually set spans as active in the current context to be able to create nested spans.

Initializing tracing with sdk-trace-base

Initializing tracing is similar to how you’d do it with Node.js or the Web SDK.

import opentelemetry from '@opentelemetry/api';
import {
  BasicTracerProvider,
  BatchSpanProcessor,
  ConsoleSpanExporter,
} from '@opentelemetry/sdk-trace-base';

const provider = new BasicTracerProvider();

// Configure span processor to send spans to the exporter
provider.addSpanProcessor(new BatchSpanProcessor(new ConsoleSpanExporter()));
provider.register();

// This is what we'll access in all instrumentation code
const tracer = opentelemetry.trace.getTracer('example-basic-tracer-node');

Like the other examples in this document, this exports a tracer you can use throughout the app.

Creating nested spans with sdk-trace-base

To create nested spans, you need to set whatever the currently-created span is as the active span in the current context. Don’t bother using startActiveSpan because it won’t do this for you.

const mainWork = () => {
  const parentSpan = tracer.startSpan('main');

  for (let i = 0; i < 3; i += 1) {
    doWork(parentSpan, i);
  }

  // Be sure to end the parent span!
  parentSpan.end();
};

const doWork = (parent, i) => {
  // To create a child span, we need to mark the current (parent) span as the active span
  // in the context, then use the resulting context to create a child span.
  const ctx = opentelemetry.trace.setSpan(
    opentelemetry.context.active(),
    parent,
  );
  const span = tracer.startSpan(`doWork:${i}`, undefined, ctx);

  // simulate some random work.
  for (let i = 0; i <= Math.floor(Math.random() * 40000000); i += 1) {
    // empty
  }

  // Make sure to end this child span! If you don't,
  // it will continue to track work beyond 'doWork'!
  span.end();
};

All other APIs behave the same when you use sdk-trace-base compared with the Node.js SDKs.

Using Phoenix Decorators

As part of the OpenInference library, Phoenix provides helpful abstractions to make manual instrumentation easier.

OpenInference OTEL Tracing

This documentation provides a guide on using OpenInference OTEL tracing decorators and methods for instrumenting functions, chains, agents, and tools using OpenTelemetry.

These tools can be combined with, or used in place of, OpenTelemetry instrumentation code. They are designed to simplify the instrumentation process.

If you'd prefer to use pure OTEL instead, see Setup using base OTEL

Installation

Ensure you have OpenInference and OpenTelemetry installed:

pip install openinference-semantic-conventions opentelemetry-api opentelemetry-sdk

Setting Up Tracing

You can configure the tracer using either TracerProvider from openinference.instrumentation or using phoenix.otel.register.

from phoenix.otel import register

tracer_provider = register(protocol="http/protobuf", project_name="your project name")
tracer = tracer_provider.get_tracer(__name__)

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

from openinference.instrumentation import TracerProvider
from openinference.semconv.resource import ResourceAttributes

endpoint = "http://127.0.0.1:6006/v1/traces"
resource = Resource(attributes={ResourceAttributes.PROJECT_NAME: "openinference-tracer"})
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))
tracer = tracer_provider.get_tracer(__name__)

Using your Tracer

Your tracer object can now be used in two primary ways:

1. As a decorator to trace entire functions

@tracer.chain
def my_func(input: str) -> str:
    return "output"

This entire function will appear as a Span in Phoenix. Input and output attributes in Phoenix will be set automatically based on my_func's parameters and return. The status attribute will also be set automatically.

2. As a with clause to trace specific code blocks

with tracer.start_as_current_span(
    "my-span-name",
    openinference_span_kind="chain",
) as span:
    span.set_input("input")
    span.set_output("output")
    span.set_status(Status(StatusCode.OK))

The code within this clause will be captured as a Span in Phoenix. Here the input, output, and status must be set manually.

This approach is useful when you need only a portion of a method to be captured as a Span.

OpenInference Span Kinds

OpenInference Span Kinds denote the possible types of spans you might capture, and will be rendered different in the Phoenix UI.

The possible values are:\

Span Kind

Use

CHAIN

General logic operations, functions, or code blocks

LLM

Making LLM calls

TOOL

Completing tool calls

RETRIEVER

Retrieving documents

EMBEDDING

Generating embeddings

AGENT

Agent invokations - typically a top level or near top level span

RERANKER

Reranking retrieved context

UNKNOWN

Unknown

GUARDRAIL

Guardrail checks

EVALUATOR

Evaluators - typically only use by Phoenix when automatically tracing evaluation and experiment calls

Chains

Using Context Managers

with tracer.start_as_current_span(
    "chain-span-with-plain-text-io",
    openinference_span_kind="chain",
) as span:
    span.set_input("input")
    span.set_output("output")
    span.set_status(Status(StatusCode.OK))

Using Decorators

@tracer.chain
def decorated_chain_with_plain_text_output(input: str) -> str:
    return "output"

decorated_chain_with_plain_text_output("input")

Using JSON Output

@tracer.chain
def decorated_chain_with_json_output(input: str) -> Dict[str, Any]:
    return {"output": "output"}

decorated_chain_with_json_output("input")

Overriding Span Name

@tracer.chain(name="decorated-chain-with-overriden-name")
def this_name_should_be_overriden(input: str) -> Dict[str, Any]:
    return {"output": "output"}

this_name_should_be_overriden("input")

Agents

Using Context Managers

with tracer.start_as_current_span(
    "agent-span-with-plain-text-io",
    openinference_span_kind="agent",
) as span:
    span.set_input("input")
    span.set_output("output")
    span.set_status(Status(StatusCode.OK))

Using Decorators

@tracer.agent
def decorated_agent(input: str) -> str:
    return "output"

decorated_agent("input")

Tools

Using Context Managers

with tracer.start_as_current_span(
    "tool-span",
    openinference_span_kind="tool",
) as span:
    span.set_input("input")
    span.set_output("output")
    span.set_tool(
        name="tool-name",
        description="tool-description",
        parameters={"input": "input"},
    )
    span.set_status(Status(StatusCode.OK))

Using Decorators

@tracer.tool
def decorated_tool(input1: str, input2: int) -> None:
    """
    tool-description
    """

decorated_tool("input1", 1)

Overriding Tool Name

@tracer.tool(
    name="decorated-tool-with-overriden-name",
    description="overriden-tool-description",
)
def this_tool_name_should_be_overriden(input1: str, input2: int) -> None:
    """
    this tool description should be overriden
    """

this_tool_name_should_be_overriden("input1", 1)

LLMs

Like other span kinds, LLM spans can be instrumented either via a context manager or via a decorator pattern. It's also possible to directly patch client methods.

While this guide uses the OpenAI Python client for illustration, in practice, you should use the OpenInference auto-instrumentors for OpenAI whenever possible and resort to manual instrumentation for LLM spans only as a last resort.

To run the snippets in this section, set your OPENAI_API_KEY environment variable.

Context Manager

from openai import OpenAI
from opentelemetry.trace import Status, StatusCode

openai_client = OpenAI()

messages = [{"role": "user", "content": "Hello, world!"}]
with tracer.start_as_current_span("llm_span", openinference_span_kind="llm") as span:
    span.set_input(messages)
    try:
        response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=messages,
        )
    except Exception as error:
        span.record_exception(error)
        span.set_status(Status(StatusCode.ERROR))
    else:
        span.set_output(response)
        span.set_status(Status(StatusCode.OK))

Decorator

from typing import List

from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam

openai_client = OpenAI()


@tracer.llm
def invoke_llm(
    messages: List[ChatCompletionMessageParam],
) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    message = response.choices[0].message
    return message.content or ""


invoke_llm([{"role": "user", "content": "Hello, world!"}])

This decorator pattern above works for sync functions, async coroutine functions, sync generator functions, and async generator functions. Here's an example with an async generator.

from typing import AsyncGenerator, List

from openai import AsyncOpenAI
from openai.types.chat import ChatCompletionMessageParam

openai_async_client = AsyncOpenAI()


@tracer.llm
async def stream_llm_responses(
    messages: List[ChatCompletionMessageParam],
) -> AsyncGenerator[str, None]:
    stream = await openai_async_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content


# invoke inside of an async context
async for token in stream_llm_responses([{"role": "user", "content": "Hello, world!"}]):
    print(token, end="")

Method Patch

It's also possible to directly patch methods on a client. This is useful if you want to transparently use the client in your application with instrumentation logic localized in one place.

from openai import OpenAI

openai_client = OpenAI()

# patch the create method
wrapper = tracer.llm
openai_client.chat.completions.create = wrapper(openai_client.chat.completions.create)

# invoke the patched method normally
openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

The snippets above produce LLM spans with input and output values, but don't offer rich UI for messages, tools, invocation parameters, etc. In order to manually instrument LLM spans with these features, users can define their own functions to wrangle the input and output of their LLM calls into OpenInference format. The openinference-instrumentation library contains helper functions that produce valid OpenInference attributes for LLM spans:

get_llm_attributes
get_input_attributes
get_output_attributes

For OpenAI, these functions might look like this:

from typing import Any, Dict, List, Optional, Union

from openai.types.chat import (
    ChatCompletion,
    ChatCompletionMessage,
    ChatCompletionMessageParam,
    ChatCompletionToolParam,
)
from opentelemetry.util.types import AttributeValue

import openinference.instrumentation as oi
from openinference.instrumentation import (
    get_input_attributes,
    get_llm_attributes,
    get_output_attributes,
)


def process_input(
    messages: List[ChatCompletionMessageParam],
    model: str,
    temperature: Optional[float] = None,
    tools: Optional[List[ChatCompletionToolParam]] = None,
    **kwargs: Any,
) -> Dict[str, AttributeValue]:
    oi_messages = [convert_openai_message_to_oi_message(message) for message in messages]
    oi_tools = [convert_openai_tool_param_to_oi_tool(tool) for tool in tools or []]
    return {
        **get_input_attributes(
            {
                "messages": messages,
                "model": model,
                "temperature": temperature,
                "tools": tools,
                **kwargs,
            }
        ),
        **get_llm_attributes(
            provider="openai",
            system="openai",
            model_name=model,
            input_messages=oi_messages,
            invocation_parameters={"temperature": temperature},
            tools=oi_tools,
        ),
    }


def convert_openai_message_to_oi_message(
    message_param: Union[ChatCompletionMessageParam, ChatCompletionMessage],
) -> oi.Message:
    if isinstance(message_param, ChatCompletionMessage):
        role: str = message_param.role
        oi_message = oi.Message(role=role)
        if isinstance(content := message_param.content, str):
            oi_message["content"] = content
        if message_param.tool_calls is not None:
            oi_tool_calls: List[oi.ToolCall] = []
            for tool_call in message_param.tool_calls:
                function = tool_call.function
                oi_tool_calls.append(
                    oi.ToolCall(
                        id=tool_call.id,
                        function=oi.ToolCallFunction(
                            name=function.name,
                            arguments=function.arguments,
                        ),
                    )
                )
            oi_message["tool_calls"] = oi_tool_calls
        return oi_message

    role = message_param["role"]
    assert isinstance(message_param["content"], str)
    content = message_param["content"]
    return oi.Message(role=role, content=content)


def convert_openai_tool_param_to_oi_tool(tool_param: ChatCompletionToolParam) -> oi.Tool:
    assert tool_param["type"] == "function"
    return oi.Tool(json_schema=dict(tool_param))


def process_output(response: ChatCompletion) -> Dict[str, AttributeValue]:
    message = response.choices[0].message
    role = message.role
    oi_message = oi.Message(role=role)
    if isinstance(message.content, str):
        oi_message["content"] = message.content
    if isinstance(message.tool_calls, list):
        oi_tool_calls: List[oi.ToolCall] = []
        for tool_call in message.tool_calls:
            tool_call_id = tool_call.id
            function_name = tool_call.function.name
            function_arguments = tool_call.function.arguments
            oi_tool_calls.append(
                oi.ToolCall(
                    id=tool_call_id,
                    function=oi.ToolCallFunction(
                        name=function_name,
                        arguments=function_arguments,
                    ),
                )
            )
        oi_message["tool_calls"] = oi_tool_calls
    output_messages = [oi_message]
    token_usage = response.usage
    oi_token_count: Optional[oi.TokenCount] = None
    if token_usage is not None:
        prompt_tokens = token_usage.prompt_tokens
        completion_tokens = token_usage.completion_tokens
        oi_token_count = oi.TokenCount(
            prompt=prompt_tokens,
            completion=completion_tokens,
        )
    return {
        **get_llm_attributes(
            output_messages=output_messages,
            token_count=oi_token_count,
        ),
        **get_output_attributes(response),
    }

Context Manager

When using a context manager to create LLM spans, these functions can be used to wrangle inputs and outputs.

import json

from openai import OpenAI
from openai.types.chat import (
    ChatCompletionMessage,
    ChatCompletionMessageParam,
    ChatCompletionToolMessageParam,
    ChatCompletionToolParam,
    ChatCompletionUserMessageParam,
)
from opentelemetry.trace import Status, StatusCode

openai_client = OpenAI()


@tracer.tool
def get_weather(city: str) -> str:
    # make an call to a weather API here
    return "sunny"


messages: List[Union[ChatCompletionMessage, ChatCompletionMessageParam]] = [
    ChatCompletionUserMessageParam(
        role="user",
        content="What's the weather like in San Francisco?",
    )
]
temperature = 0.5
invocation_parameters = {"temperature": temperature}
tools: List[ChatCompletionToolParam] = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "finds the weather for a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to find the weather for, e.g. 'London'",
                    }
                },
                "required": ["city"],
            },
        },
    },
]

with tracer.start_as_current_span(
    "llm_tool_call",
    attributes=process_input(
        messages=messages,
        invocation_parameters={"temperature": temperature},
        model="gpt-4",
    ),
    openinference_span_kind="llm",
) as span:
    try:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=temperature,
            tools=tools,
        )
    except Exception as error:
        span.record_exception(error)
        span.set_status(Status(StatusCode.ERROR))
    else:
        span.set_attributes(process_output(response))
        span.set_status(Status(StatusCode.OK))

output_message = response.choices[0].message
tool_calls = output_message.tool_calls
assert tool_calls and len(tool_calls) == 1
tool_call = tool_calls[0]
city = json.loads(tool_call.function.arguments)["city"]
weather = get_weather(city)
messages.append(output_message)
messages.append(
    ChatCompletionToolMessageParam(
        content=weather,
        role="tool",
        tool_call_id=tool_call.id,
    )
)

with tracer.start_as_current_span(
    "tool_call_response",
    attributes=process_input(
        messages=messages,
        invocation_parameters={"temperature": temperature},
        model="gpt-4",
    ),
    openinference_span_kind="llm",
) as span:
    try:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=temperature,
        )
    except Exception as error:
        span.record_exception(error)
        span.set_status(Status(StatusCode.ERROR))
    else:
        span.set_attributes(process_output(response))
        span.set_status(Status(StatusCode.OK))

Decorator

When using the tracer.llm decorator, these functions are passed via the process_input and process_output parameters and should satisfy the following:

The input signature of process_input should exactly match the input signature of the decorated function.
The input signature of process_output has a single argument, the output of the decorated function. This argument accepts the returned value when the decorated function is a sync or async function, or a list of yielded values when the decorated function is a sync or async generator function.
Both process_input and process_output should output a dictionary mapping attribute names to values.

from openai import NOT_GIVEN, OpenAI
from openai.types.chat import ChatCompletion

openai_client = OpenAI()


@tracer.llm(
    process_input=process_input,
    process_output=process_output,
)
def invoke_llm(
    messages: List[ChatCompletionMessageParam],
    model: str,
    temperature: Optional[float] = None,
    tools: Optional[List[ChatCompletionToolParam]] = None,
) -> ChatCompletion:
    response: ChatCompletion = openai_client.chat.completions.create(
        messages=messages,
        model=model,
        tools=tools or NOT_GIVEN,
        temperature=temperature,
    )
    return response


invoke_llm(
    messages=[{"role": "user", "content": "Hello, world!"}],
    temperature=0.5,
    model="gpt-4",
)

When decorating a generator function, process_output should accept a single argument, a list of the values yielded by the decorated function.

from typing import Dict, List, Optional

from openai.types.chat import ChatCompletionChunk
from opentelemetry.util.types import AttributeValue

import openinference.instrumentation as oi
from openinference.instrumentation import (
    get_llm_attributes,
    get_output_attributes,
)


def process_generator_output(
    outputs: List[ChatCompletionChunk],
) -> Dict[str, AttributeValue]:
    role: Optional[str] = None
    content = ""
    oi_token_count = oi.TokenCount()
    for chunk in outputs:
        if choices := chunk.choices:
            assert len(choices) == 1
            delta = choices[0].delta
            if isinstance(delta.content, str):
                content += delta.content
            if isinstance(delta.role, str):
                role = delta.role
        if (usage := chunk.usage) is not None:
            if (prompt_tokens := usage.prompt_tokens) is not None:
                oi_token_count["prompt"] = prompt_tokens
            if (completion_tokens := usage.completion_tokens) is not None:
                oi_token_count["completion"] = completion_tokens
    oi_messages = []
    if role and content:
        oi_messages.append(oi.Message(role=role, content=content))
    return {
        **get_llm_attributes(
            output_messages=oi_messages,
            token_count=oi_token_count,
        ),
        **get_output_attributes(content),
    }

Then the decoration is the same as before.

from typing import AsyncGenerator

from openai import AsyncOpenAI
from openai.types.chat import ChatCompletionChunk

openai_async_client = AsyncOpenAI()


@tracer.llm(
    process_input=process_input,  # same as before
    process_output=process_generator_output,
)
async def stream_llm_response(
    messages: List[ChatCompletionMessageParam],
    model: str,
    temperature: Optional[float] = None,
) -> AsyncGenerator[ChatCompletionChunk, None]:
    async for chunk in await openai_async_client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=temperature,
        stream=True,
    ):
        yield chunk


async for chunk in stream_llm_response(
    messages=[{"role": "user", "content": "Hello, world!"}],
    temperature=0.5,
    model="gpt-4",
):
    print(chunk)

Method Patch

As before, it's possible to directly patch the method on the client. Just ensure that the input signatures of process_input and the patched method match.

from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam

openai_client = OpenAI()

# patch the create method
wrapper = tracer.llm(
    process_input=process_input,
    process_output=process_output,
)
openai_client.chat.completions.create = wrapper(openai_client.chat.completions.create)

# invoke the patched method normally
openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

Additional Features

Suppress Tracing

with suppress_tracing():
    # this trace will not be recorded
    with tracer.start_as_current_span(
        "THIS-SPAN-SHOULD-NOT-BE-TRACED",
        openinference_span_kind="chain",
    ) as span:
        span.set_input("input")
        span.set_output("output")
        span.set_status(Status(StatusCode.OK))

Using Context Attributes

with using_attributes(session_id="123"):
    # this trace has session id "123"
    with tracer.start_as_current_span(
        "chain-span-with-context-attributes",
        openinference_span_kind="chain",
    ) as span:
        span.set_input("input")
        span.set_output("output")
        span.set_status(Status(StatusCode.OK))

Adding Images to your Traces

OpenInference includes message types that can be useful in composing text and image or other file inputs and outputs:

import openinference.instrumentation as oi

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
text = "describe the weather in this image"
content = [
        {"type": "text", "text": text},
        {
            "type": "image_url",
            "image_url": {"url": image_url, "detail": "low"},
        },
    ]

image = oi.Image(url=image_url)
contents = [
    oi.TextMessageContent(
        type="text",
        text=text,
    ),
    oi.ImageMessageContent(
        type="image",
        image=image,
    ),
]
messages = [
    oi.Message(
        role="user",
        contents=contents,
    )
]

with tracer.start_as_current_span(
    "my-span-name",
    openinference_span_kind="llm",
    attributes=oi.get_llm_attributes(input_messages=messages)
) as span:
    span.set_input(text)
    
    # Call your LLM here
    response = "This is a test response"

    span.set_output(response)
    print(response.content)

Import Your Data

How to create Phoenix inferences and schemas for common data formats

This guide shows you how to define Phoenix inferences using your own data.

Once you have a pandas dataframe df containing your data and a schema object describing the format of your dataframe, you can define your Phoenix dataset either by running

ds = px.Inferences(df, schema)

or by optionally providing a name for your dataset that will appear in the UI:

ds = px.Inferences(df, schema, name="training")

As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.

Predictions and Actuals

Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime objects that represent the time at which each inference was made in production.

Dataframe

timestamp

prediction_score

prediction

target

2023-03-01 02:02:19

0.91

click

2023-02-17 23:45:48

0.37

no_click

2023-01-30 15:30:03

0.54

click

no_click

2023-02-03 19:56:09

0.74

click

2023-02-24 04:23:43

0.37

no_click

click

Schema

schema = px.Schema(
    timestamp_column_name="timestamp",
    prediction_score_column_name="prediction_score",
    prediction_label_column_name="prediction",
    actual_label_column_name="target",
)

This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.

Features and Tags

Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.

Dataframe

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

age

gender

predicted

target

578

Scammeds

4300

62966

RENT

110

male

not_fraud

fraud

507

Schiller Ltd

21000

52335

RENT

129

female

not_fraud

656

Kirlin and Sons

18000

94995

MORTGAGE

female

uncertain

414

Scammeds

18000

32034

LEASE

male

fraud

not_fraud

512

Champlin and Sons

20000

46005

OWN

148

male

uncertain

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    feature_column_names=[
        "fico_score",
        "merchant_id",
        "loan_amount",
        "annual_income",
        "home_ownership",
        "num_credit_lines",
        "inquests_in_last_6_months",
        "months_since_last_delinquency",
    ],
    tag_column_names=[
        "age",
        "gender",
    ],
)

Implicit Features

If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names field of your schema set to its default value of None, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.

Dataframe

target

predicted

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
)

Excluded Columns

You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.

Dataframe

target

predicted

hospital

insurance_provider

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

Pacific Clinics

uninsured

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

Queens Hospital

Anthem Blue Cross

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

St. Francis Memorial Hospital

Blue Shield of CA

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

Pacific Clinics

Kaiser Permanente

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

CityMed

Anthem Blue Cross

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    excluded_column_names=[
        "hospital",
        "insurance_provider",
    ],
)

Embedding Features

Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames to associate multiple dataframe columns with the same embedding feature.

The example in this section contain low-dimensional embeddings for the sake of easy viewing. Your embeddings in practice will typically have much higher dimension.

Embedding Vectors

To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name field on px.EmbeddingColumnNames. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:

Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names is a dictionary.
The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
The values of this dictionary are instances of px.EmbeddingColumnNames.
Each entry in the "embedding_vector" column is a list of length 4.

Dataframe

predicted

target

embedding_vector

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

fraud

not_fraud

[-0.97, 3.98, -0.03, 2.92]

604

Leannon Ward

22000

100781

RENT

108

fraud

not_fraud

[3.20, 3.95, 2.81, -0.09]

612

Scammeds

7500

116184

MORTGAGE

not_fraud

[-0.49, -0.62, 0.08, 2.03]

646

Leannon Ward

32000

73666

RENT

131

not_fraud

[1.69, 0.01, -0.76, 3.64]

560

Kirlin and Sons

19000

38589

MORTGAGE

131

uncertain

[1.46, 0.69, 3.26, -0.17]

636

Champlin and Sons

10000

100251

MORTGAGE

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    embedding_feature_column_names={
        "transaction_embeddings": px.EmbeddingColumnNames(
            vector_column_name="embedding_vector"
        ),
    },
)

To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.

Embeddings of Images

If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name field on px.EmbeddingColumnNames. The following example contains data for an image classification model that detects product defects on an assembly line.

Dataframe

defective

image

image_vector

okay

https://www.example.com/image0.jpeg

[1.73, 2.67, 2.91, 1.79, 1.29]

defective

https://www.example.com/image1.jpeg

[2.18, -0.21, 0.87, 3.84, -0.97]

okay

https://www.example.com/image2.jpeg

[3.36, -0.62, 2.40, -0.94, 3.69]

defective

https://www.example.com/image3.jpeg

[2.77, 2.79, 3.36, 0.60, 3.10]

okay

https://www.example.com/image4.jpeg

[1.79, 2.06, 0.53, 3.58, 0.24]

Schema

schema = px.Schema(
    actual_label_column_name="defective",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Local Images

For local image data, we recommend the following steps to serve your images via a local HTTP server:

In your terminal, navigate to a directory containing your image data and run python -m http.server 8000.
Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.

For example, suppose your HTTP server is running in a directory with the following contents:

.
└── image-data
    └── example_image.jpeg

Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.

Embeddings of Text

If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name field on px.EmbeddingColumnNames. The embeddings below were generated by a sentiment classification model trained on product reviews.

Dataframe

name

text

text_vector

Schema

schema = px.Schema(
    actual_label_column_name="sentiment",
    feature_column_names=[
        "category",
    ],
    tag_column_names=[
        "name",
    ],
    embedding_feature_column_names={
        "product_review_embeddings": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        ),
    },
)

Multiple Embedding Features

Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.

Dataframe

name

description

description_vector

image

image_vector

Magic Lamp

Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.

[2.47, -0.01, -0.22, 0.93]

https://www.example.com/image0.jpeg

[2.42, 1.95, 0.81, 2.60, 0.27]

Ergo Desk Chair

The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.

[-0.25, 0.07, 2.90, 1.57]

https://www.example.com/image1.jpeg

[3.17, 2.75, 1.39, 0.44, 3.30]

Cloud Nine Mattress

Our Cloud Nine Mattress combines cool comfort with maximum affordability.

[1.36, -0.88, -0.45, 0.84]

https://www.example.com/image2.jpeg

[-0.22, 0.87, 1.10, -0.78, 1.25]

Dr. Fresh's Spearmint Toothpaste

Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula

[-0.39, 1.29, 0.92, 2.51]

https://www.example.com/image3.jpeg

[1.95, 2.66, 3.97, 0.90, 2.86]

Ultra-Fuzzy Bath Mat

The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom.

[0.37, 3.22, 1.29, 0.65]

https://www.example.com/image4.jpeg

[0.77, 1.79, 0.52, 3.79, 0.47]

Schema

schema = px.Schema(
    tag_column_names=["name"],
    embedding_feature_column_names={
        "description_embedding": px.EmbeddingColumnNames(
            vector_column_name="description_vector",
            raw_data_column_name="description",
        ),
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Distinct embedding features may have embedding vectors of differing length. The text embeddings in the above example have length 4 while the image embeddings have length 5.