1 of 100

English

Arize Phoenix

AI Observability and Evaluation

Phoenix is an open-source observability tool designed for experimentation, evaluation, and troubleshooting of AI and LLM applications. It allows AI engineers and data scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by Arize AI, the company behind the industry-leading AI observability platform, and a set of core contributors.

Phoenix works with OpenTelemetry and OpenInference instrumentation. See Integrations for details.

Features

Tracing is a helpful tool for understanding how your LLM application works. Phoenix's open-source library offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework.

Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks (LlamaIndex, LangChain, DSPy), SDKs (OpenAI, Bedrock, Mistral, Vertex), and Languages. (Python, Javascript, etc.)

Phoenix is built to help you evaluate your application and understand their true performance. To accomplish this, Phoenix includes:

A standalone library to run LLM-based evaluations on your own datasets. This can be used either with the Phoenix library, or independently over your own data.
Direct integration of LLM-based and code-based evaluators into the Phoenix dashboard. Phoenix is built to be agnostic, and so these evals can be generated using Phoenix's library, or an external library like Ragas, Deepeval, or Cleanlab.
Human annotation capabilities to attach human ground truth labels to your data in Phoenix.

Phoenix offers tools to streamline your prompt engineering workflow.

Prompt Management - Create, store, modify, and deploy prompts for interacting with LLMs
Prompt Playground - Play with prompts, models, invocation parameters and track your progress via tracing and experiments
Span Replay - Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
Prompts in Code - Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.

Phoenix Datasets & Experiments let you test different versions of your application, store relevant traces for evaluation and analysis, and build robust evaluations into your development process.

Run Experiments to test and compare different iterations of your application
Collect relevant traces into a Dataset, or directly upload Datasets from code / CSV
Run Datasets through Prompt Playground, export them in fine-tuning format, or attach them to an Experiment.

Quickstarts

Running Phoenix for the first time? Select a quickstart below.

Next Steps

Try our Tutorials

Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.

Add Integrations

Add instrumentation for popular packages and libraries such as OpenAI, LangGraph, Vercel AI SDK and more.

Community

Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.

Quickstarts

Not sure where to start? Try a quickstart:

Phoenix runs seamlessly in notebooks, containers, Phoenix Cloud, or from the terminal.

Learn more about Environment options.

User Guide

Phoenix is a comprehensive platform designed to enable observability across every layer of an LLM-based system, empowering teams to build, optimize, and maintain high-quality applications and agents efficiently.

🛠️ Develop

During the development phase, Phoenix offers essential tools for debugging, experimentation, evaluation, prompt tracking, and search and retrieval.

Traces for Debugging

Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.

Quickstart: Tracing

Experimentation

Leverage experiments to measure prompt and model performance. Typically during this early stage, you'll focus on gathering a robust set of test cases and evaluation metrics to test initial iterations of your application. Experiments at this stage may resemble unit tests, as they're geared towards ensuring your application performs correctly.

Run Experiments

Evaluation

Either as a part of experiments or a standalone feature, evaluations help you understand how your app is performing at a granular level. Typical evaluations might be correctness evals compared against a ground truth data set, or LLM-as-a-judge evals to detect hallucinations or relevant RAG output.

Quickstart: Evals

Prompt Engineering

Prompt engineering is critical to how a model behaves. While there are other methods such as fine-tuning to change behavior, prompt engineering is the simplest way to get started and often has the best ROI.

Overview: Prompts

Instrument prompt and prompt variable collection to associate iterations of your app with the performance measured through evals and experiments. Phoenix tracks prompt templates, variables, and versions during execution to help you identify improvements and degradations.

Instrument Prompt Templates and Prompt Variables

Search & Retrieval Embeddings Visualizer

Phoenix's search and retrieval optimization tools include an embeddings visualizer that helps teams understand how their data is being represented and clustered. This visual insight can guide decisions on indexing strategies, similarity measures, and data organization to improve the relevance and efficiency of search results.

Quickstart: Inferences

🧪 Testing/Staging

In the testing and staging environment, Phoenix supports comprehensive evaluation, benchmarking, and data curation. Traces, experimentation, prompt tracking, and embedding visualizer remain important in the testing and staging phase, helping teams identify and resolve issues before deployment.

Iterate via Experiments

With a stable set of test cases and evaluations defined, you can now easily iterate on your application and view performance changes in Phoenix right away. Swap out models, prompts, or pipeline logic, and run your experiment to immediately see the impact on performance.

Run Experiments

Evals Testing

Phoenix's flexible evaluation framework supports thorough testing of LLM outputs. Teams can define custom metrics, collect user feedback, and leverage separate LLMs for automated assessment. Phoenix offers tools for analyzing evaluation results, identifying trends, and tracking improvements over time.

Quickstart: Evals

Curate Data

Phoenix assists in curating high-quality data for testing and fine-tuning. It provides tools for data exploration, cleaning, and labeling, enabling teams to curate representative data that covers a wide range of use cases and edge conditions.

Quickstart: Datasets & Experiments

Guardrails

Add guardrails to your application to prevent malicious and erroneous inputs and outputs. Guardrails will be visualized in Phoenix, and can be attached to spans and traces in the same fashion as evaluation metrics.

https://github.com/Arize-ai/phoenix/blob/docs/docs/broken-reference/README.md

🚀 Production

In production, Phoenix works hand-in-hand with Arize, which focuses on the production side of the LLM lifecycle. The integration ensures a smooth transition from development to production, with consistent tooling and metrics across both platforms.

Traces in Production

Phoenix and Arize use the same collector frameworks in development and production. This allows teams to monitor latency, token usage, and other performance metrics, setting up alerts when thresholds are exceeded.

Quickstart: Tracing

Evals for Production

Phoenix's evaluation framework can be used to generate ongoing assessments of LLM performance in production. Arize complements this with online evaluations, enabling teams to set up alerts if evaluation metrics, such as hallucination rates, go beyond acceptable thresholds.

How to: Evals

Fine-tuning

Phoenix and Arize together help teams identify data points for fine-tuning based on production performance and user feedback. This targeted approach ensures that fine-tuning efforts are directed towards the most impactful areas, maximizing the return on investment.

Phoenix, in collaboration with Arize, empowers teams to build, optimize, and maintain high-quality LLM applications throughout the entire lifecycle. By providing a comprehensive observability platform and seamless integration with production monitoring tools, Phoenix and Arize enable teams to deliver exceptional LLM-driven experiences with confidence and efficiency.

Fine-Tuning

Environments

The Phoenix app can be run in various environments such as Colab and SageMaker notebooks, as well as be served via the terminal or a docker container.

If you are set up, see Quickstarts to start using Phoenix in your preferred environment.

Phoenix Cloud

Phoenix Cloud provides free-to-use Phoenix instances that are preconfigured for you with 10GBs of storage space. Phoenix Cloud instances are a great starting point, however if you need more storage or more control over your instance, self-hosting options could be a better fit.

If you're using Phoenix Cloud, be sure to set the proper environment variables to connect to your instance:

import os

# Add Phoenix API Key for tracing
os.environ["PHOENIX_API_KEY"] = "ADD YOUR PHOENIX API KEY"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "ADD YOUR PHOENIX HOSTNAME"

# If you created your Phoenix Cloud instance before June 24th, 2025,
# you also need to set the API key as a header
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

Container

See Self-Hosting.

Notebooks

To start phoenix in a notebook environment, run:

import phoenix as px

session = px.launch_app()

This will start a local Phoenix server. You can initialize the phoenix server with various kinds of data (traces, inferences).

By default, Phoenix does not persist your data when run in a notebook.

Terminal

If you want to start a phoenix server to collect traces, you can also run phoenix directly from the command line:

phoenix serve

This will start the phoenix server on port 6006. If you are running your instrumented notebook or application on the same machine, traces should automatically be exported to http://127.0.0.1:6006 so no additional configuration is needed. However if the server is running remotely, you will have to modify the environment variable PHOENIX_COLLECTOR_ENDPOINT to point to that machine (e.g. http://<my-remote-machine>:<port>)

Production Guide

Moving your application to production: steps for reliability and scale

Moving your Phoenix deployment from development to production requires additional configuration for reliability, performance, and scalability. This page outlines the key steps and considerations to prepare your Phoenix instance for production workloads.

Configuration

Enable Batch Processing

Turn on the batch processor for spans, metrics, and logs. Batching improves data compression and reduces the number of outgoing connections required to transmit data efficiently. This is critical for stable ingestion at higher volumes.

The batch processor supports:

Size-based batching (batch emits when a max number of items is reached)
Time-based batching (batch emits after a configurable timeout)

Use gRPC Transport

Switch your exporters to use gRPC wherever possible to maximize payload compression and reduce network overhead in production environments.

Scaling Facilities

Plan for scaling resources to match your workload, including:

Memory scaling for high-cardinality workloads or long retention windows.
Disk scaling for log and trace ingestion, especially if retaining high volumes.
Horizontal scaling if your deployment needs to handle increased concurrency.

Enable Database Backups

Ensure that automated backups are enabled for your Postgres instance. This protects your data and allows recovery in the event of failures or data corruption.

Resourcing Guidelines

Depending on your workload, you might need to provision your Phoenix instance with varying memory and data resources.

Memory Sizing

Memory requirements depend on several factors:

Ingestion volume: Higher volumes of traces and logs increase memory needs for processing and indexing.
Variety of labels and attributes: Workloads with many unique labels and attributes require additional memory for tracking and querying.
Retention settings: Longer retention windows increase memory requirements for in-memory caching and indexing.

Monitor memory usage under expected production load and adjust resources to maintain your application performance.

Database Sizing

For production and scalable deployments, Phoenix supports PostgreSQL. The database size will depend on:

Ingestion rate: Higher data ingestion will increase storage usage.
Retention periods: Longer data retention requires additional storage capacity.
Variety of labels and attributes: Workloads with many unique values consume more database space for indexing and storage.

Regularly monitor disk utilization to plan for scaling and ensure stable, reliable operation.

Backups

A solid backup plan protects your data and supports disaster recovery. Implement a Postgres backup strategy that considers:

Backup frequency: How often backups occur.
Backup methods: Such as point-in-time recovery (PITR) and full backups.
Test restores: Regularly verify backups by restoring data.

Tracing

Overview: Tracing

Tracing the execution of LLM applications using Telemetry

Phoenix traces AI applications, via OpenTelemetry and has first-class integrations with LlamaIndex, LangChain, OpenAI, and others.

LLM tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation to provide a detailed timeline of the request's execution.

Tracing is a helpful tool for understanding how your LLM application works. Phoenix offers comprehensive tracing capabilities that are not tied to any specific LLM vendor or framework. Phoenix accepts traces over the OpenTelemetry protocol (OTLP) and supports first-class instrumentation for a variety of frameworks ( LlamaIndex, LangChain, DSPy), SDKs (OpenAI, Bedrock, Mistral, Vertex), and languages. (Python, JavaScript, etc.)

Using Phoenix's tracing capabilities can provide important insights into the inner workings of your LLM application. By analyzing the collected trace data, you can identify and address various performance and operational issues and improve the overall reliability and efficiency of your system.

Application Latency: Identify and address slow invocations of LLMs, Retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
Token Usage: Gain a detailed breakdown of token usage for your LLM calls, allowing you to identify and optimize the most expensive LLM invocations.
Runtime Exceptions: Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
Retrieved Documents: Inspect the documents retrieved during a Retriever call, including the score and order in which they were returned to provide insight into the retrieval process.
Embeddings: Examine the embedding text used for retrieval and the underlying embedding model to allow you to validate and refine your embedding strategies.
LLM Parameters: Inspect the parameters used when calling an LLM, such as temperature and system prompts, to ensure optimal configuration and debugging.
Prompt Templates: Understand the prompt templates used during the prompting step and the variables that were applied, allowing you to fine-tune and improve your prompting strategies.
Tool Descriptions: View the descriptions and function signatures of the tools your LLM has been given access to in order to better understand and control your LLM’s capabilities.
LLM Function Calls: For LLMs with function call capabilities (e.g., OpenAI), you can inspect the function selection and function messages in the input to the LLM, further improving your ability to debug and optimize your application.

By using tracing in Phoenix, you can gain increased visibility into your LLM application, empowering you to identify and address performance bottlenecks, optimize resource utilization, and ensure the overall reliability and effectiveness of your system.

Next steps

To get started, check out the Quickstart guide.
Read more about and .
Check out the How-To Guides for specific tutorials.

Quickstart: Tracing

Phoenix supports three main options to collect traces:

Use to mark functions and code blocks.
Use to capture all calls made to supported frameworks.
Use instrumentation. Supported in and

Quickstarts

Explore a Demo Trace

Quickstart: Tracing (Python)

Overview

Phoenix supports three main options to collect traces:

Use Phoenix's decorators to mark functions and code blocks.
Use automatic instrumentation to capture all calls made to supported frameworks.
Use base OpenTelemetry instrumentation. Supported in Python and TS / JS, among many other languages.

This example uses options 1 and 2.

Launch Phoenix

Connect to Phoenix

To collect traces from your application, you must configure an OpenTelemetry TracerProvider to send traces to Phoenix.

pip install arize-phoenix-otel

from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(
  project_name="my-llm-app", # Default is 'default'
  auto_instrument=True, # See 'Trace all calls made to a library' below
)
tracer = tracer_provider.get_tracer(__name__)

Trace your own functions

Functions can be traced using decorators:

@tracer.chain
def my_func(input: str) -> str:
    return "output"

Input and output attributes are set automatically based on my_func's parameters and return.

Trace all calls made to a library

Phoenix can also capture all calls made to supported libraries automatically. Just install the respective OpenInference library:

pip install openinference-instrumentation-openai

OpenInference libraries must be installed before calling the register function

# Add OpenAI API Key
import os
import openai

os.environ["OPENAI_API_KEY"] = "ADD YOUR OPENAI API KEY"

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku."}],
)
print(response.choices[0].message.content)

View your Traces in Phoenix

You should now see traces in Phoenix!

Next Steps

Explore tracing integrations
Customize tracing
View use cases to see end-to-end examples

Features: Tracing

Tracing is a critical part of AI Observability and should be used both in production and development

This section contains details on Tracing features:

Projects
Annotations
Sessions

Projects

Use projects to organize your LLM traces

Projects provide organizational structure for your AI applications, allowing you to logically separate your observability data. This separation is essential for maintaining clarity and focus.

With Projects, you can:

Segregate traces by environment (development, staging, production)
Isolate different applications or use cases
Track separate experiments without cross-contamination
Maintain dedicated evaluation spaces for specific initiatives
Create team-specific workspaces for collaborative analysis

Projects act as containers that keep related traces and conversations together while preventing them from interfering with unrelated work. This organization becomes increasingly valuable as you scale - allowing you to easily switch between contexts without losing your place or mixing data.

The Project structure also enables comparative analysis across different implementations, models, or time periods. You can run parallel versions of your application in separate projects, then analyze the differences to identify improvements or regressions.

Explore a Demo Project

Annotations

In order to improve your LLM application iteratively, it's vital to collect feedback, annotate data during human review, as well as to establish an evaluation pipeline so that you can monitor your application. In Phoenix we capture this type of feedback in the form of annotations.

Phoenix gives you the ability to annotate traces with feedback from the UI, your application, or wherever you would like to perform evaluation. Phoenix's annotation model is simple yet powerful - given an entity such as a span that is collected, you can assign a label and/or a score to that entity.

Next Steps

Learn more about the concepts:
Configure Annotation Configs to guide human annotations.
How to run
Learn how to log annotations via the client from your app or in a notebook

Sessions

Track and analyze multi-turn conversations

Sessions enable tracking and organizing related traces across multi-turn conversations with your AI application. When building conversational AI, maintaining context between interactions is critical - Sessions make this possible from an observability perspective.

With Sessions in Phoenix, you can:

Track the entire history of a conversation in a single thread
View conversations in a chatbot-like UI showing inputs and outputs of each turn
Search through sessions to find specific interactions
Track token usage and latency per conversation

This feature is particularly valuable for applications where context builds over time, like chatbots, virtual assistants, or any other multi-turn interaction. By tagging spans with a consistent session ID, you create a connected view that reveals how your application performs across an entire user journey.

Next Steps

Check out how to

Setup using Phoenix OTEL

phoenix.otel is a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults.

pip install arize-phoenix-otel

These defaults are aware of environment variables you may have set to configure Phoenix:

PHOENIX_COLLECTOR_ENDPOINT
PHOENIX_PROJECT_NAME
PHOENIX_CLIENT_HEADERS
PHOENIX_API_KEY
PHOENIX_GRPC_PORT

Quickstart: `phoenix.otel.register`

The phoenix.otel module provides a high-level register function to configure OpenTelemetry tracing by setting a global TracerProvider. The register function can also configure headers and whether or not to process spans one by one or by batch.

from phoenix.otel import register
tracer_provider = register(
    project_name="default", # sets a project name for spans
    batch=True, # uses a batch span processor
    auto_instrument=True, # uses all installed OpenInference instrumentors
)

Phoenix Authentication

If the PHOENIX_API_KEY environment variable is set, register will automatically add an authorization header to each span payload.

Configuring the collector endpoint

There are two ways to configure the collector endpoint:

Using environment variables
Using the endpoint keyword argument

Using environment variables

If you're setting the PHOENIX_COLLECTOR_ENDPOINT environment variable, register will automatically try to send spans to your Phoenix server using gRPC.

# export PHOENIX_COLLECTOR_ENDPOINT=https://your-phoenix.com:6006

from phoenix.otel import register

# sends traces to https://your-phoenix.com:4317
tracer_provider = register()

# export PHOENIX_COLLECTOR_ENDPOINT=https://your-phoenix.com:6006

from phoenix.otel import register

# sends traces to https://your-phoenix.com/v1/traces
tracer_provider = register(
    protocol="http/protobuf",
)

Specifying the `endpoint` directly

When passing in the endpoint argument, you must specify the fully qualified endpoint. If the PHOENIX_GRPC_PORT environment variable is set, it will override the default gRPC port.

The HTTP transport protocol is inferred from the endpoint

from phoenix.otel import register
tracer_provider = register(endpoint="http://localhost:6006/v1/traces")

The GRPC transport protocol is inferred from the endpoint

from phoenix.otel import register
tracer_provider = register(endpoint="http://localhost:4317")

Additionally, the protocol argument can be used to enforce the OTLP transport protocol regardless of the endpoint. This might be useful in cases such as when the GRPC endpoint is bound to a different port than the default (4317). The valid protocols are: "http/protobuf", and "grpc".

from phoenix.otel import register
tracer_provider = register(
    endpoint="http://localhost:9999",
    protocol="grpc", # use "http/protobuf" for http transport
)

Additional configuration

register can be configured with different keyword arguments:

project_name: The Phoenix project name
- or use PHOENIX_PROJECT_NAME env. var
headers: Headers to send along with each span payload
- or use PHOENIX_CLIENT_HEADERS env. var
batch: Whether or not to process spans in batch

from phoenix.otel import register
tracer_provider = register(
    project_name="otel-test",
    headers={"Authorization": "Bearer TOKEN"},
    batch=True,
)

Instrumentation

Once you've connected your application to your Phoenix instance using phoenix.otel.register, you need to instrument your application. You have a few options to do this:

Using OpenInference auto-instrumentors. If you've used the auto_instrument flag above, then any instrumentor packages in your environment will be called automatically. For a full list of OpenInference packages, see https://arize.com/docs/phoenix/integrations
Using Phoenix Decorators.
Using Base OTEL.

Setup Projects

Log to a specific project

Phoenix uses projects to group traces. If left unspecified, all traces are sent to a default project.

In the notebook, you can set the PHOENIX_PROJECT_NAME environment variable before adding instrumentation or running any of your code.

In python this would look like:

Note that setting a project via an environment variable only works in a notebook and must be done BEFORE instrumentation is initialized. If you are using OpenInference Instrumentation, see the Server tab for how to set the project name in the Resource attributes.

Alternatively, you can set the project name in your register function call:

If you are using Phoenix as a collector and running your application separately, you can set the project name in the Resource attributes for the trace provider.

Projects work by setting something called the Resource attributes (as seen in the OTEL example above). The phoenix server uses the project name attribute to group traces into the appropriate project.

Switching projects in a notebook

Typically you want traces for an LLM app to all be grouped in one project. However, while working with Phoenix inside a notebook, we provide a utility to temporarily associate spans with different projects. You can use this to trace things like evaluations.

Add Metadata

Tracing can be augmented and customized by adding Metadata. Metadata includes your own custom attributes, user ids, session ids, prompt templates, and more.

Learn how to add custom metadata and attributes to your traces

Learn how to define custom prompt templates and variables in your tracing.

Instrument Prompt Templates and Prompt Variables

Instrumenting prompt templates and variables allows you to track and visualize prompt changes. These can also be combined with Experiments to measure the performance changes driven by each of your prompts.

We provide a using_prompt_template context manager to add a prompt template (including its version and variables) to the current OpenTelemetry Context. OpenInference auto-instrumentors will read this Context and pass the prompt template fields as span attributes, following the OpenInference semantic conventions. Its inputs must be of the following type:

Template: non-empty string.
Version: non-empty string.
Variables: a dictionary with string keys. This dictionary will be serialized to JSON when saved to the OTEL Context and remain a JSON string when sent as a span attribute.

It can also be used as a decorator:

@using_prompt_template(
    template=prompt_template,
    variables=prompt_template_variables,
    version="v1.0",
)
def call_fn(*args, **kwargs):
    # Calls within this function will generate spans with the attributes:
    # "llm.prompt_template.template" = "Please describe the weather forecast for {city} on {date}"
    # "llm.prompt_template.version" = "v1.0"
    # "llm.prompt_template.variables" = "{\"city\": \"Johannesburg\", \"date\": \"July 11\"}" # JSON serialized
    ...

We provide a setPromptTemplate function which allows you to set a template, version, and variables on context. You can use this utility in conjunction with context.with to set the active context. OpenInference auto instrumentations will then pick up these attributes and add them to any spans created within the context.with callback. The components of a prompt template are:

template - a string with templated variables ex. "hello {{name}}"
variables - an object with variable names and their values ex. {name: "world"}
version - a string version of the template ex. v1.0

All of these are optional. Application of variables to a template will typically happen before the call to an llm and may not be picked up by auto instrumentation. So, this can be helpful to add to ensure you can see the templates and variables while troubleshooting.

import { context } from "@opentelemetry/api"
import { setPromptTemplate } from "@openinference-core"

context.with(
  setPromptTemplate(
    context.active(),
    { 
      template: "hello {{name}}",
      variables: { name: "world" },
      version: "v1.0"
    }
  ),
  () => {
      // Calls within this block will generate spans with the attributes:
      // "llm.prompt_template.template" = "hello {{name}}"
      // "llm.prompt_template.version" = "v1.0"
      // "llm.prompt_template.variables" = '{ "name": "world" }'
  }
)

Annotate Traces

Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness

Phoenix allows you to annotate traces through the Client, the REST API, or the UI.

Guides

To learn how to configure annotations and to annotate through the UI, see Annotating in the UI
To learn how to add human labels to your traces, either manually or programmatically, see Annotating via the Client
To learn how to evaluate traces captured in Phoenix, see Running Evals on Traces
To learn how to upload your own evaluation labels into Phoenix, see Log Evaluation Results

For more background on the concept of annotations, see Annotations

Annotating in the UI

How to annotate traces in the UI for analysis and dataset curation

Configuring Annotations

To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.

Annotation Types

Annotation Type: - Categorical: Predefined labels for selection. (e.x. 👍 or 👎) - Continuous: a score across a specified range. (e.g. confidence score 0-100) - Freeform: Open-ended text comments. (e.g. "correct")

Optimize the direction based on your goal: - Maximize: higher scores are better. (e.g. confidence) - Minimize: lower scores are better. (e.g. hallucinations) - None: direction optimization does not apply. (e.g. tone)

Adding Annotations

Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate button and fill out the form to rate different steps in your AI application. You can also take notes as you go by either clicking on the explain link or by adding your notes to the bottom messages UI. You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations tab.

Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.

Viewing Annotations

As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.

Exporting Traces with specific Annotation Values

Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.

Annotating Auto-Instrumented Spans

Use the capture_span_context context manager to annotate auto-instrumented spans

Assumes you are using openinference-instrumentation>=0.1.34

When working with spans that are automatically instrumented via OpenInference in your LLM applications, you often need to capture span contexts to apply feedback or annotations. The capture_span_context context manager provides a convenient way to capture all OpenInference spans within its scope, making it easier to apply feedback to specific spans in downstream operations.

The capture_span_context context manager allows you to:

Capture all spans created within a specific code block
Retrieve span contexts for later use in feedback systems
Maintain a clean separation between span creation and annotation logic
Apply feedback to spans without needing to track span IDs manually

Usage

You can use the captured span contexts to implement custom feedback logic. The captured span contexts integrate seamlessly with Phoenix's annotation system:

from openinference.instrumentation import capture_span_context
from opentelemetry.trace.span import format_span_id
from phoenix.client import Client

client = Client()

def process_llm_request_with_feedback(prompt: str):
    with capture_span_context() as capture:
        # Make LLM call (auto-instrumented)
        response = llm.invoke("Generate a summary")
        # Get user feedback (simulated)
        user_feedback = get_user_feedback(response)
        
        # Method 1: Get span ID using get_last_span_id (most recent span)
        last_span_id = capture.get_last_span_id()
        # Apply feedback to the most recent span
        if last_span_id:
            client.annotations.add_span_annotation(
                annotation_name="user_feedback",
                annotator_kind="HUMAN",
                span_id=last_span_id,
                label=user_feedback.label,
                score=user_feedback.score,
                explanation=user_feedback.explanation
            )
        
        # Method 2: Get all captured span contexts and iterate
        span_contexts = capture.get_span_contexts()
        # Apply feedback to all captured spans
        for span_context in span_contexts:
            # Convert span context to span ID for annotation
            span_id = format_span_id(span_context.span_id)
            
            # Add annotation to Phoenix
            client.annotations.add_span_annotation(
                annotation_name="user_feedback_all",
                annotator_kind="HUMAN",
                span_id=span_id,
                label=user_feedback.label,
                score=user_feedback.score,
                explanation=user_feedback.explanation
            )

Working with Multiple Span Types

You can filter spans based on their attributes:

with capture_span_context() as capture:
    # Make LLM call (auto-instrumented)
    response = llm.invoke("Generate a summary")
    
    span_contexts = capture.get_span_contexts()
    
    # Filter for specific span types
    llm_spans = [
        ctx for ctx in span_contexts 
        if hasattr(ctx, 'attributes')
    ]
    
    # Apply different feedback logic to different span types
    for span_context in llm_spans:
        apply_llm_feedback(span_context)

Resources

OpenInference

Running Evals on Traces

How to use an LLM judge to label and score your application

This guide will walk you through the process of evaluating traces captured in Phoenix, and exporting the results to the Phoenix UI.

This process is similar to the , but instead of creating your own dataset or using an existing external one, you'll export a trace dataset from Phoenix and log the evaluation results to Phoenix.

Install dependencies & Set environment variables

Connect to Phoenix

Note: if you're self-hosting Phoenix, swap your collector endpoint variable in the snippet below, and remove the Phoenix Client Headers variable.

Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here.

Prepare trace dataset

For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to "Download trace dataset from Phoenix"

Download trace dataset from Phoenix

Generate evaluations

Now that we have our trace dataset, we can generate evaluations for each trace. Evaluations can be generated in many different ways. Ultimately, we want to end up with a set of labels and/or scores for our traces.

You can generate evaluations using:

Plain code
Phoenix's
Your own
Other evaluation packages

As long as you format your evaluation results properly, you can upload them to Phoenix and visualize them in the UI.

Let's start with a simple example of generating evaluations using plain code. OpenAI has a habit of repeating jokes, so we'll generate evaluations to label whether a joke is a repeat of a previous joke.

We now have a DataFrame with a column for whether each joke is a repeat of a previous joke. Let's upload this to Phoenix.

Upload evaluations to Phoenix

Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "label" and "score" to display in the UI.

You should now see evaluations in the Phoenix UI!

From here you can continue collecting and evaluating traces, or move on to one of these other guides:

If you're interested in more complex evaluation and evaluators, start with
If you're ready to start testing your application in a more rigorous manner, check out

Log Evaluation Results

This guide shows how LLM evaluation results in dataframes can be sent to Phoenix.

An evaluation must have a name (e.g. "Q&A Correctness") and its DataFrame must contain identifiers for the subject of evaluation, e.g. a span or a document (more on that below), and values under either the score, label, or explanation columns. See for more information.

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

Span Evaluations

A dataframe of span evaluations would look similar like the table below. It must contain span_id as an index or as a column. Once ingested, Phoenix uses the span_id to associate the evaluation with its target span.

span_id

label

score

explanation

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Q&A Correctness".

Document Evaluations

A dataframe of document evaluations would look something like the table below. It must contain span_id and document_position as either indices or columns. document_position is the document's (zero-based) index in the span's list of retrieved documents. Once ingested, Phoenix uses the span_id and document_position to associate the evaluation with its target span and document.

span_id

document_position

label

score

explanation

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Relevance".

Logging Multiple Evaluation DataFrames

Multiple sets of Evaluations can be logged by the same px.Client().log_evaluations() function call.

Specifying A Project for the Evaluations

By default the client will push traces to the project specified in the PHOENIX_PROJECT_NAME environment variable or to the default project. If you want to specify the destination project explicitly, you can pass the project name as a parameter.

Importing & Exporting Traces

Learn how to load a file of traces into Phoenix

Learn how to export trace data from Phoenix

Import Existing Traces

Phoenix supports loading data that contains OpenInference traces. This allows you to load an existing dataframe of traces into your Phoenix instance.

Usually these will be traces you've previously saved using Save All Traces.

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

import os

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.

Importing Traces to an Existing Phoenix Instance

import phoenix as px

# Re-launch the app using trace data
px.launch_app(trace=px.TraceDataset(df))

# Load traces into an existing Phoenix instance
px.Client().log_traces(trace_dataset=px.TraceDataset(df))

# Load traces into an existing Phoenix instance from a local file
px.launch_app(trace=px.TraceDataset.load('f7733fda-6ad6-4427-a803-55ad2182b662', directory="/my_saved_traces/"))

Launching a new Phoenix Instance with Saved Traces

You can also launch a temporary version of Phoenix in your local notebook to quickly view the traces. But be warned, this Phoenix instance will only last as long as your notebook environment is runing

# Load traces from a dataframe
px.launch_app(trace=px.TraceDataset.load(my_traces))

# Load traces from a local file
px.launch_app(trace=px.TraceDataset.load('f7733fda-6ad6-4427-a803-55ad2182b662', directory="/my_saved_traces/"))

Exporting Annotated Spans

Span annotations can be an extremely valuable basis for improving your application. The Phoenix client provides useful ways to pull down spans and their associated annotations. This information can be used to:

build new LLM judges
form the basis for new datasets
help identify ideas for improving your application

Pulling Spans

If you only want the spans that contain a specific annotation, you can pass in a query that filters on annotation names, scores, or labels.

The queries can also filter by annotation scores and labels.

This spans dataframe can be used to pull associated annotations.

Instead of an input dataframe, you can also pass in a list of ids:

The annotations and spans dataframes can be easily joined to produce a one-row-per-annotation dataframe that can be used to analyze the annotations!

Advanced

Learn how to block PII from logging to Phoenix

Learn how to selectively block or turn off tracing

Learn how to send only certain spans to Phoenix

Learn how to trace images

Mask Span Attributes

In some situations, you may need to modify the observability level of your tracing. For instance, you may want to keep sensitive information from being logged for security reasons, or you may want to limit the size of the base64 encoded images logged to reduced payload size.

The OpenInference Specification defines a set of environment variables you can configure to suit your observability needs. In addition, the OpenInference auto-instrumentors accept a trace config which allows you to set these value in code without having to set environment variables, if that's what you prefer

The possible settings are:

Environment Variable Name

Effect

Type

Default

OPENINFERENCE_HIDE_INPUTS

Hides input value, all input messages & embedding input text

bool

False

OPENINFERENCE_HIDE_OUTPUTS

Hides output value & all output messages

bool

False

OPENINFERENCE_HIDE_INPUT_MESSAGES

Hides all input messages & embedding input text

bool

False

OPENINFERENCE_HIDE_OUTPUT_MESSAGES

Hides all output messages

bool

False

PENINFERENCE_HIDE_INPUT_IMAGES

Hides images from input messages

bool

False

OPENINFERENCE_HIDE_INPUT_TEXT

Hides text from input messages & input embeddings

bool

False

OPENINFERENCE_HIDE_OUTPUT_TEXT

Hides text from output messages

bool

False

OPENINFERENCE_HIDE_EMBEDDING_VECTORS

Hides returned embedding vectors

bool

False

OPENINFERENCE_HIDE_LLM_INVOCATION_PARAMETERS

Hides LLM invocation parameters

bool

False

OPENINFERENCE_HIDE_LLM_PROMPTS

Hides LLM prompts span attributes

bool

False

OPENINFERENCE_BASE64_IMAGE_MAX_LENGTH

Limits characters of a base64 encoding of an image

int

32,000

To set up this configuration you can either:

Set environment variables as specified above
Define the configuration in code as shown below
Do nothing and fall back to the default values
Use a combination of the three, the order of precedence is:
- Values set in the TraceConfig in code
- Environment variables
- default values

Below is an example of how to set these values in code using our OpenAI Python and JavaScript instrumentors, however, the config is respected by all of our auto-instrumentors.

from openinference.instrumentation import TraceConfig
config = TraceConfig(        
    hide_inputs=...,
    hide_outputs=...,
    hide_input_messages=...,
    hide_output_messages=...,
    hide_input_images=...,
    hide_input_text=...,
    hide_output_text=...,
    hide_embedding_vectors=...,
    hide_llm_invocation_parameters=...,
    hide_llm_prompts=...,
    base64_image_max_length=...,
)

from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(
    tracer_provider=tracer_provider,
    config=config,
)

/**
 * Everything left out of here will fallback to
 * environment variables then defaults
 */
const traceConfig = { hideInputs: true } 

const instrumentation = new OpenAIInstrumentation({ traceConfig })

Suppress Tracing

How to turn off tracing

Tracing can be paused temporarily or disabled permanently.

Pause tracing using context manager

If there is a section of your code for which tracing is not desired, e.g. the document chunking process, it can be put inside the suppress_tracing context manager as shown below.

from phoenix.trace import suppress_tracing

with suppress_tracing():
    # Code running inside this block doesn't generate traces.
    # For example, running LLM evals here won't generate additional traces.
    ...
# Tracing will resume outside the block.
...

Uninstrument the auto-instrumentors permanently

Calling .uninstrument() on the auto-instrumentors will remove tracing permanently. Below is the examples for LangChain, LlamaIndex and OpenAI, respectively.

LangChainInstrumentor().uninstrument()
LlamaIndexInstrumentor().uninstrument()
OpenAIInstrumentor().uninstrument()
# etc.

Filter Spans to Export

Sometimes while instrumenting your application, you may want to filter out or modify certain spans from being sent to Phoenix. For example, you may want to filter out spans that are that contain sensitive information or contain redundant information.

To do this, you can use a custom SpanProcessor and attach it to the OpenTelemetry TracerProvider.

In this example, we're filtering out any spans that have the name "secret_span" by bypassing the on_start and on_end hooks of the inherited BatchSpanProcessor.

Notice that this logic can be extended to modify a span and redact sensitive information if preserving the span is preferred.

Capture Multimodal Traces

Phoenix supports displaying images that are included in LLM traces.

To view images in Phoenix

Include either a base64 UTF-8 encoded image or an image url in the call made to your LLM

Example

You should see your image appear in Phoenix:

Prompt Engineering

Overview: Prompts

Prompt management allows you to create, store, and modify prompts for interacting with LLMs. By managing prompts systematically, you can improve reuse, consistency, and experiment with variations across different models and inputs.

Unlike traditional software, AI applications are non-deterministic and depend on natural language to provide context and guide model output. The pieces of natural language and associated model parameters embedded in your program are known as “prompts.”

Optimizing your prompts is typically the highest-leverage way to improve the behavior of your application, but “prompt engineering” comes with its own set of challenges. You want to be confident that changes to your prompts have the intended effect and don’t introduce regressions.

To get started, jump to .

Prompt Engineering Features

Phoenix offers a comprehensive suite of features to streamline your prompt engineering workflow.

- Create, store, modify, and deploy prompts for interacting with LLMs
- Play with prompts, models, invocation parameters and track your progress via tracing and experiments
- Replay the invocation of an LLM. Whether it's an LLM step in an LLM workflow or a router query, you can step into the LLM invocation and see if any modifications to the invocation would have yielded a better outcome.
- Phoenix offers client SDKs to keep your prompts in sync across different applications and environments.

Explore Demo Prompts

Prompt Management

Version and track changes made to prompt templates

Key benefits of prompt management include:

Reusability: Store and load prompts across different use cases.
Versioning: Track changes over time to ensure that the best performing version is deployed for use in your application.
Collaboration: Share prompts with others to maintain consistency and facilitate iteration.

To learn how to get started with prompt management, see Create a prompt

Prompt Playground

Phoenix's Prompt Playground makes the process of iterating and testing prompts quick and easy. Phoenix's playground supports (OpenAI, Anthropic, Gemini, Azure) as well as custom model endpoints, making it the ideal prompt IDE for you to build experiment and evaluate prompts and models for your task.

Speed: Rapidly test variations in the , model, invocation parameters, , and output format.
Reproducibility: All runs of the playground are , unlocking annotations and evaluation.
Datasets: Use as a fixture to run a prompt variant through its paces and to evaluate it systematically.
Prompt Management: directly within the playground.

To learn more on how to use the playground, see

Span Replay

Replay LLM spans traced in your application directly in the playground

Have you ever wanted to go back into a multi-step LLM chain and just replay one step to see if you could get a better outcome? Well you can with Phoenix's Span Replay. LLM spans that are stored within Phoenix can be loaded into the Prompt Playground and replayed. Replaying spans inside of Playground enables you to debug and improve the performance of your LLM systems by comparing LLM provider outputs, tweaking model parameters, changing prompt text, and more.

Chat completions generated inside of Playground are automatically instrumented, and the recorded spans are immediately available to be replayed inside of Playground.

Prompts in Code

Pull and push prompt changes via Phoenix's Python and TypeScript Clients

Using Phoenix as a backend, Prompts can be managed and manipulated via code by using our Python or TypeScript SDKs.

With the Phoenix Client SDK you can:

prompts dynamically
templates by name, version, or tag
templates with runtime variables and use them in your code. Native support for OpenAI, Anthropic, Gemini, Vercel AI SDK, and more. No propriatry client necessary.
Support for and . Execute tools defined within the prompt. Phoenix prompts encompasses more than just the text and messages.

To learn more about managing Prompts in code, see

Quickstart: Prompts

Prompts in Phoenix can be created, iterated on, versioned, tagged, and used either via the UI or our Python/TS SDKs. The UI option also includes our Prompt Playground, which allows you to compare prompt variations side-by-side in the Phoenix UI.

Quickstarts

Quickstart: Prompts (Python)

This guide will show you how to setup and use Prompts through Phoenix's Python SDK

Installation

Start out by installing the Phoenix library:

Connect to Phoenix

You'll need to specify your Phoenix endpoint before you can interact with the Client. The easiest way to do this is through an environment variable.

Creating a Prompt

Now you can create a prompt. In this example, you'll create a summarization Prompt.

Prompts in Phoenix have names, as well as multiple versions. When you create your prompt, you'll define its name. Then, each time you update your prompt, that will create a new version of the prompt under the same name.

Your prompt will now appear in your Phoenix dashboard:

Retrieving a Prompt

You can retrieve a prompt by name, tag, or version:

Using a Prompt

To use a prompt, call the prompt.format()function. Any {{ variables }} in the prompt can be set by passing in a dictionary of values.

Updating a Prompt

To update a prompt with a new version, simply call the create function using the existing prompt name:

The new version will appear in your Phoenix dashboard:

Congratulations! You can now create, update, access and use prompts using the Phoenix SDK!

Next Steps

From here, check out:

How to use your prompts in
Prompt iteration

How to: Prompts

Guides on how to do prompt engineering with Phoenix

Getting Started

Configure AI Providers - how to configure API keys for OpenAI, Anthropic, Gemini, and more.

Prompt Management

Organize and manage prompts with Phoenix to streamline your development workflow

Prompt management is currently available on a feature branch only and will be released in the next major version.

Create a prompt - how to create, update, and track prompt changes
Test a prompt - how to test changes to a prompt in the playground and in the notebook
Tag a prompt - how to mark certain prompt versions as ready for production
Using a prompt - how to integrate prompts into your code and experiments

Playground

Iterate on prompts and models in the prompt playground

Using the Playground - how to setup the playground and how to test prompt changes via datasets and experiments.

Configure AI Providers

Phoenix natively integrates with OpenAI, Azure OpenAI, Anthropic, and Google AI Studio (gemini) to make it easy to test changes to your prompts. In addition to the above, since many AI providers (deepseek, ollama) can be used directly with the OpenAI client, you can talk to any OpenAI compatible LLM provider.

Credentials

To securely provide your API keys, you have two options. One is to store them in your browser in local storage. Alternatively, you can set them as environment variables on the server side. If both are set at the same time, the credential set in the browser will take precedence.

Option 1: Store API Keys in the Browser

API keys can be entered in the playground application via the API Keys dropdown menu. This option stores API keys in the browser. Simply navigate to to settings and set your API keys.

Option 2: Set Environment Variables on Server Side

Available on self-hosted Phoenix

If the following variables are set in the server environment, they'll be used at API invocation time.

Provider

Environment Variable

Platform Link

OpenAI

OPENAI_API_KEY

Azure OpenAI

AZURE_OPENAI_API_KEY
AZURE_OPENAI_ENDPOINT
OPENAI_API_VERSION

Anthropic

ANTHROPIC_API_KEY

Gemini

GEMINI_API_KEY or GOOGLE_API_KEY

For Azure, you can also set the following server-side environment variables: AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_FEDERATED_TOKEN_FILE to use WorkloadIdentityCredential.

Using OpenAI Compatible LLMs

Option 1: Configure the base URL in the prompt playground

Since you can configure the base URL for the OpenAI client, you can use the prompt playground with a variety of OpenAI Client compatible LLMs such as Ollama, DeepSeek, and more.\

If you are using an LLM provider, you will have to set the OpenAI api key to that provider's api key for it to work.

OpenAI Client compatible providers Include

Provider

Base URL

Docs

DeepSeek

Ollama

Option 2: Server side configuration of the OpenAI base URL

Optionally, the server can be configured with the OPENAI_BASE_URL environment variable to change target any OpenAI compatible REST API.

For app.phoenix.arize.com, this may fail due to security reasons. In that case, you'd see a Connection Error appear.

If there is a LLM endpoint you would like to use, reach out to mailto://phoenix-support@arize.com

Using the Playground

General guidelines on how to use Phoenix's prompt playground

Setup

To first get started, you will first . In the playground view, create a valid prompt for the LLM and click Run on the top right (or the mod + enter)

If successful you should see the LLM output stream out in the Output section of the UI.

Prompt Editor

The prompt editor (typically on the left side of the screen) is where you define the . You select the template language (mustache or f-string) on the toolbar. Whenever you type a variable placeholder in the prompt (say {{question}} for mustache), the variable to fill will show up in the inputs section. Input variables must either be filled in by hand or can be filled in via a dataset (where each row has key / value pairs for the input).

Model Configuration

Every prompt instance can be configured to use a specific LLM and set of invocation parameters. Click on the model configuration button at the top of the prompt editor and configure your LLM of choice. Click on the "save as default" option to make your configuration sticky across playground sessions.

Comparing Prompts

The Prompt Playground offers the capability to compare multiple prompt variants directly within the playground. Simply click the + Compare button at the top of the first prompt to create duplicate instances. Each prompt variant manages its own independent template, model, and parameters. This allows you to quickly compare prompts (labeled A, B, C, and D in the UI) and run experiments to determine which prompt and model configuration is optimal for the given task.

Using Datasets with Prompts

Phoenix lets you run a prompt (or multiple prompts) on a dataset. Simply containing the input variables you want to use in your prompt template. When you click Run, Phoenix will apply each configured prompt to every example in the dataset, invoking the LLM for all possible prompt-example combinations. The result of your playground runs will be tracked as an experiment under the loaded dataset (see )

Playground Traces

All invocations of an LLM via the playground is recorded for analysis, annotations, evaluations, and dataset curation.

If you simply run an LLM in the playground using the free form inputs (e.g. not using a dataset), Your spans will be recorded in a project aptly titled "playground".

If however you run a prompt over dataset examples, the outputs and spans from your playground runs will be captured as an experiment. Each experiment will be named according to the prompt you ran the experiment over.

Test a prompt

Testing your prompts before you ship them is vital to deploying reliable AI applications

Testing in the Playground

Testing a prompt in the playground

The Playground is a fast and efficient way to refine prompt variations. You can load previous prompts and validate their performance by applying different variables.

Each single-run test in the Playground is recorded as a span in the Playground project, allowing you to revisit and analyze LLM invocations later. These spans can be added to datasets or reloaded for further testing.

Testing a prompt over a dataset

The ideal way to test a prompt is to construct a golden dataset where the dataset examples contains the variables to be applied to the prompt in the inputs and the outputs contains the ideal answer you want from the LLM. This way you can run a given prompt over N number of examples all at once and compare the synthesized answers against the golden answers.

Playground integrates with to help you iterate and incrementally improve your prompts. Experiment runs are automatically recorded and available for subsequent evaluation to help you understand how changes to your prompts, LLM model, or invocation parameters affect performance.

Testing prompt variations side-by-side

Prompt Playground supports side-by-side comparisons of multiple prompt variants. Click + Compare to add a new variant. Whether using Span Replay or testing prompts over a Dataset, the Playground processes inputs through each variant and displays the results for easy comparison.

Testing a prompt using code

Sometimes you may want to test a prompt and run evaluations on a given prompt. This can be particularly useful when custom manipulation is needed (e.x. you are trying to iterate on a system prompt on a variety of different chat messages). 🚧 This tutorial is coming soon

Tag a prompt

How to deploy prompts to different environments safely

Prompts in Phoenix are versioned in a linear history, creating a comprehensive audit trail of all modifications. Each change is tracked, allowing you to:

Review the complete history of a prompt
Understand who made specific changes
Revert to previous versions if needed

Creating a Tag

When you are ready to deploy a prompt to a certain environment (let's say staging), the best thing to do is to tag a specific version of your prompt as ready. By default Phoenix offers 3 tags, production, staging, and development but you can create your own tags as well.

Each tag can include an optional description to provide additional context about its purpose or significance. Tags are unique per prompt, meaning you cannot have two tags with the same name for the same prompt.

Creating a custom tag

It can be helpful to have custom tags to track different versions of a prompt. For example if you wanted to tag a certain prompt as the one that was used in your v0 release, you can create a custom tag with that name to keep track!

When creating a custom tag, you can provide:

A name for the tag (must be a valid identifier)
An optional description to provide context about the tag's purpose

Pulling a prompt by tag

Once a prompt version is tagged, you can pull this version of the prompt into any environment that you would like (an application, an experiment). Similar to git tags, prompt version tags let you create a "release" of a prompt (e.x. pushing a prompt to staging).

You can retrieve a prompt version by:

Using the tag name directly (e.g., "production", "staging", "development")
Using a custom tag name
Using the latest version (which will return the most recent version regardless of tags)

For full details on how to use prompts in code, see

Listing tags

You can list all tags associated with a specific prompt version. The list is paginated, allowing you to efficiently browse through large numbers of tags. Each tag in the list includes:

The tag's unique identifier
The tag's name
The tag's description (if provided)

This is particularly useful when you need to:

Review all tags associated with a prompt version
Verify which version is currently tagged for a specific environment
Track the history of tag changes for a prompt version

Using the Client

Tag Naming Rules

Tag names must be valid identifiers: lowercase letters, numbers, hyphens, and underscores, starting and ending with a letter or number.

Examples: staging, production-v1, release-2024

Creating and Managing Tags

Datasets & Experiments

Overview: Datasets & Experiments

The velocity of AI application development is bottlenecked by quality evaluations because AI engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. High quality evaluations are critical as they can help developers answer these types of questions with greater confidence.

Datasets

Datasets are integral to evaluation. They are collections of examples that provide the inputs and, optionally, expected reference outputs for assessing your application. Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are used to run experiments and evaluations to track improvements to your prompt, LLM, or other parts of your LLM application.

Experiments

In AI development, it's hard to understand how a change will affect performance. This breaks the dev flow, making iteration more guesswork than engineering.

Experiments and evaluations solve this, helping distill the indeterminism of LLMs into tangible feedback that helps you ship more reliable product.

Specifically, good evals help you:

Understand whether an update is an improvement or a regression
Drill down into good / bad examples
Compare specific examples vs. prior runs
Avoid guesswork

How-to: Datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning,

How to create datasets

Datasets are critical assets for building robust prompts, evals, fine-tuning, and much more. Phoenix allows you to build datasets manually, programmatically, or from files.

Create datasets from CSV
Create datasets from Pandas
Create datasets from spans
Create datasets using synthetic data

Exporting datasets

Export datasets for offline analysis, evals, and fine-tuning.

Exporting to CSV - how to quickly download a dataset to use elsewhere
Exporting to OpenAI Ft - want to fine tune an LLM for better accuracy and cost? Export llm examples for fine-tuning.
Exporting to OpenAI Evals - have some good examples to use for benchmarking of llms using OpenAI evals? export to OpenAI evals format.

Creating Datasets

From CSV

When manually creating a dataset (let's say collecting hypothetical questions and answers), the easiest way to start is by using a spreadsheet. Once you've collected the information, you can simply upload the CSV of your data to the Phoenix platform using the UI. You can also programmatically upload tabular data using Pandas as

From Pandas

From Objects

Sometimes you just want to upload datasets using plain objects as CSVs and DataFrames can be too restrictive about the keys.

Synthetic Data

One of the quickest ways of getting started is to produce synthetic queries using an LLM.

One use case for synthetic data creation is when you want to test your RAG pipeline. You can leverage an LLM to synthesize hypothetical questions about your knowledge base.

In the below example we will use Phoenix's built-in llm_generate, but you can leverage any synthetic dataset creation tool you'd like.

Before running this example, ensure you've set your OPENAI_API_KEY environment variable.

Imagine you have a knowledge-base that contains the following documents:

Once your synthetic data has been created, this data can be uploaded to Phoenix for later re-use.

Once we've constructed a collection of synthetic questions, we can upload them to a Phoenix dataset.

From Spans

If you have an application that is traced using instrumentation, you can quickly add any span or group of spans using the Phoenix UI.

To add a single span to a dataset, simply select the span in the trace details view. You should see an add to dataset button on the top right. From there you can select the dataset you would like to add it to and make any changes you might need to make before saving the example.

You can also use the filters on the spans table and select multiple spans to add to a specific dataset.

Exporting Datasets

Exporting to CSV

Want to just use the contents of your dataset in another context? Simply click on the export to CSV button on the dataset page and you are good to go!

Exporting for Fine-Tuning

Fine-tuning lets you get more out of the models available by providing:

Higher quality results than prompting
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests. Phoenix natively exports OpenAI Fine-Tuning JSONL as long as the dataset contains compatible inputs and outputs.

Exporting OpenAI Evals

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. OpenAI Evals offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals. Phoenix can natively export the OpenAI Evals format as JSONL so you can use it with OpenAI Evals. See https://github.com/openai/evals for details.

How-to: Experiments

How to use evaluators

LLM Evaluators
Code Evaluators
Custom Evaluators

Evaluation

Overview: Evals

The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.

Phoenix Evals come with:

Pre-built evals - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates here. Each eval is pre-tested on a variety of eval models. Find the most up-to-date template on GitHub.
Run evals on your own data - Phoenix Evals takes a dataframe as its primary input and output, making it easy to run evaluations on your own data - whether that's logs, traces, or datasets downloaded for benchmarking.
Speed - Phoenix evals are designed for maximum speed and throughput. Evals run in batches and typically run 10x faster than calling the APIs directly.
Built-in Explanations - All Phoenix evaluations include an explanation flag that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.
Eval Models - Phoenix lets you configure which foundation model you'd like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See Eval Models

Quickstart: Evals

This quickstart guide will show you through the basics of evaluating data from your LLM application.

Install Phoenix Evals

Prepare your dataset

The first thing you'll need is a dataset to evaluate. This could be your own collected or generated set of examples, or data you've exported from Phoenix traces. If you've already collected some trace data, this makes a great starting point.

For the sake of this guide however, we'll download some pre-existing data to evaluate. Feel free to substitute this with your own data, just be sure it includes the following columns:

reference
query
response

Evaluate and Log Results

Set up evaluators (in this case for hallucinations and Q&A correctness), run the evaluations, and log the results to visualize them in Phoenix. We'll use OpenAI as our evaluation model for this example, but Phoenix also supports a number of other models. First, we need to add our OpenAI API key to our environment.

Explanation of the parameters used in run_evals above:

dataframe - a pandas dataframe that includes the data you want to evaluate. This could be spans exported from Phoenix, or data you've brought in from elsewhere. This dataframe must include the columns expected by the evaluators you are using. To see the columns expected by each built-in evaluator, check the corresponding page in the Using Phoenix Evaluators section.
evaluators - a list of built-in Phoenix evaluators to use.
provide_explanation - a binary flag that instructs the evaluators to generate explanations for their choices.

Analyze Your Evaluations

Combine your evaluation results and explanations with your original dataset:

(Optional) Log Results to Phoenix

Note: You'll only be able to log evaluations to the Phoenix UI if you used a trace or span dataset exported from Phoenix as your dataset in this quickstart. If you've used your own outside dataset, you won't be able to log these results to Phoenix.

Provided you started from a trace dataset, you can log your evaluation results to Phoenix using .

How to: Evals

(llm_classify)
(llm_generate)

Run evaluations via a job to visualize in the UI as traces stream in.

Evaluate traces captured in Phoenix and export results to the Phoenix UI.

Evaluate tasks with multiple inputs/outputs (ex: text, audio, image) using versatile evaluation tasks.

Pre-Built Evals

The following are simple functions on top of the LLM evals building blocks that are pre-tested with benchmark data.

All evals templates are tested against golden data that are available as part of the LLM eval library's and target precision at 70-90% and F1 at 70-85%.

Supported Models.

The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.

We currently support a growing set of models for LLM Evals, please check out the .

Hallucinations

When To Use Hallucination Eval Template

This LLM Eval detects if the output of a model is a hallucination based on contextual data.

This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.

This Eval is designed to check for hallucinations on private data, specifically on data that is fed into the context window from retrieval.

It is not designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations. E.g. "What was Michael Jordan's birthday?"

Hallucination Eval Template

We are continually iterating our templates, view the most up-to-date template .

How To Run the Hallucination Eval

The above Eval shows how to the the hallucination template for Eval detection.

Benchmark Results

This benchmark was obtained using notebook below. It was run using the as a ground truth dataset. Each example in the dataset was evaluating using the HALLUCINATION_PROMPT_TEMPLATE above, then the resulting labels were compared against the is_hallucination label in the HaluEval dataset to generate the confusion matrices below.

GPT-4 Results

Eval

GPT-4

Throughput

GPT-4

Toxicity

When To Use Toxicity Eval Template

The following shows the results of the toxicity Eval on a toxic dataset test to identify if the AI response is racist, biased, or toxic. The template variables are:

text: the text to be classified

Toxicity Eval Template

We are continually iterating our templates, view the most up-to-date template .

How To Run the Toxicity Eval

Benchmark Results

This benchmark was obtained using notebook below. It was run using the as a ground truth dataset. Each example in the dataset was evaluating using the TOXICITY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

GPT-4 Results

Note: Palm is not useful for Toxicity detection as it always returns "" string for toxic inputs

Toxicity Eval

GPT-4o

GPT-4

User Frustration

Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.

User Frustration Eval Template

We are continually iterating our templates, view the most up-to-date template .

The following is an example of code snippet showing how to use the eval above template:

SQL Generation Eval

SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.

Example of a Question: How many artists have names longer than 10 characters?

Example Query Generated:

SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10

The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.

SQL Eval Template

We are continually iterating our templates, view the most up-to-date template .

Running an SQL Generation Eval

Agent Function Calling Eval

The Agent Function Call eval can be used to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.

Function Calling Eval Template

We are continually iterating our templates, view the most up-to-date template .

Running an Agent Eval using the Function Calling Template

Parameters:

df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:
- question - the query made to the model. If you've to evaluate, this will the llm.input_messages column in your exported data.
- tool_call - information on the tool called and parameters included. If you've to evaluate, this will be the llm.function_call column in your exported data.

Parameter Extraction Only

This template instead evaluates only the parameter extraction step of a router:

Agent Path Convergence

When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.

One way of doing this is to calculate convergence:

Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score: avg(minimum steps taken / steps taken for this run)

This will give a convergence score of 0-1, with 1 being a perfect score.

# Assume you have an output which has a list of messages, which is the path taken
all_outputs = [
]

optimal_path_length = min(all_outputs, key = lambda output: len(output))
ratios_sum = 0

for output in all_outputs:
    run_length = len(output)
    ratio = optimal_path_length / run_length
    ratios_sum += ratio

# Calculate the average ratio
if len(all_outputs) > 0:
    convergence = ratios_sum / len(all_outputs)
else:
    convergence = 0

print(f"The optimal path length is {optimal_path_length}")
print(f"The convergence is {convergence}")

Agent Planning

This template evaluates a plan generated by an agent. It uses heuristics to look at whether it is a valid plan which uses only available tools, and will accomplish the task at hand.

Prompt Template

Agent Reflection

Use this prompt template to evaluate an agent's final response. This is an optional step, which you can use as a gate to retry a set of actions if the response or state of the world is insufficient for the given task.

Prompt Template

This prompt template is heavily inspired by the paper: "Self Reflection in LLM Agents".

You are an expert in {topic}. I will give you a user query. Your task is to reflect on your provided solution and whether it has solved the problem.
First, explain whether you believe the solution is correct or incorrect.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, create a list of detailed instructions to help you correctly solve this problem in the future if it is incorrect.

Be concise in your response; however, capture all of the essential information.

Here is the data:
    [BEGIN DATA]
    ************
    [User Query]: {user_query}
    ************
    [Tools]: {tool_definitions}
    ************
    [State]: {current_state}
    ************
    [Provided Solution]: {solution}
    [END DATA]

Audio Emotion Detection

The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.

Template Details

The following is the structure of the EMOTION_PROMPT_TEMPLATE:

You are an AI system designed to classify emotions in audio files.

### TASK:
Analyze the provided audio file and classify the primary emotion based on these characteristics:
- Tone: General tone of the speaker (e.g., cheerful, tense, calm).
- Pitch: Level and variability of the pitch (e.g., high, low, monotone).
- Pace: Speed of speech (e.g., fast, slow, steady).
- Volume: Loudness of the speech (e.g., loud, soft, moderate).
- Intensity: Emotional strength or expression (e.g., subdued, sharp, exaggerated).

The classified emotion must be one of the following:
['anger', 'happiness', 'excitement', 'sadness', 'neutral', 'frustration', 'fear', 'surprise', 'disgust', 'other']

IMPORTANT: Choose the most dominant emotion expressed in the audio. Neutral should only be used when no other emotion is clearly present; do your best to avoid this label.

************

Here is the audio to classify:

{audio}

RESPONSE FORMAT:

Provide a single word from the list above representing the detected emotion.

************

EXAMPLE RESPONSE: excitement

************

Analyze the audio and respond in this format.

Template Module

The prompt and evaluation logic are part of the phoenix.evals.default_audio_templates module and are defined as:

EMOTION_AUDIO_RAILS: Output options for the evaluation template.
EMOTION_PROMPT_TEMPLATE: Prompt used for evaluating audio emotions.

Online Evals

You can use cron to run evals client-side as your traces and spans are generated, augmenting your dataset with evaluations in an online manner. View the example in Github.

This example:

Continuously queries a LangChain application to send new traces and spans to your Phoenix session
Queries new spans once per minute and runs evals, including:
- Hallucination
- Q&A Correctness
- Relevance
Logs evaluations back to Phoenix so they appear in the UI

The evaluation script is run as a cron job, enabling you to adjust the frequency of the evaluation job:

* * * * * /path/to/python /path/to/run_evals.py

The above script can be run periodically to augment Evals in Phoenix.

Retrieval

Overview: Retrieval

Many LLM applications use a technique called Retrieval Augmented Generation. These applications retrieve data from their knowledge base to help the LLM accomplish tasks with the appropriate context.

However, these retrieval systems can still hallucinate or provide answers that are not relevant to the user's input query. We can evaluate retrieval systems by checking for:

Are there certain types of questions the chatbot gets wrong more often?
Are the documents that the system retrieves irrelevant? Do we have the right documents to answer the question?
Does the response match the provided documents?

Phoenix supports retrievals troubleshooting and evaluation on both traces and inferences, but inferences are currently required to visualize your retrievals using a UMAP. See below on the differences.

Feature

Traces & Spans

Inferences

Check out our to get started. Look at our to better understand how to troubleshoot and evaluate different kinds of retrieval systems. For a high level overview on evaluation, check out our .

Inferences

How-to: Inferences

How to export your data for labeling, evaluation, or fine-tuning

Evals API Reference

Evals are LLM-powered functions that you can use to evaluate the output of your LLM or generative application

phoenix.evals.run_evals

def run_evals(
    dataframe: pd.DataFrame,
    evaluators: List[LLMEvaluator],
    provide_explanation: bool = False,
    use_function_calling_if_available: bool = True,
    verbose: bool = False,
    concurrency: int = 20,
) -> List[pd.DataFrame]

Evaluates a pandas dataframe using a set of user-specified evaluators that assess each row for relevance of retrieved documents, hallucinations, toxicity, etc. Outputs a list of dataframes, one for each evaluator, that contain the labels, scores, and optional explanations from the corresponding evaluator applied to the input dataframe.

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents an individual record to be evaluated. Each evaluator uses an LLM and an evaluation prompt template to assess the rows of the dataframe, and those template variables must appear as column names in the dataframe.
evaluators (List[LLMEvaluator]): A list of evaluators to apply to the input dataframe. Each evaluator class accepts a model as input, which is used in conjunction with an evaluation prompt template to evaluate the rows of the input dataframe and to output labels, scores, and optional explanations. Currently supported evaluators include:
- HallucinationEvaluator: Evaluates whether a response (stored under an "output" column) is a hallucination given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).
- RelevanceEvaluator: Evaluates whether a retrieved document (stored under a "reference" column) is relevant or irrelevant to the corresponding query (stored under an "input" column).
- ToxicityEvaluator: Evaluates whether a string (stored under an "input" column) contains racist, sexist, chauvinistic, biased, or otherwise toxic content.
- QAEvaluator: Evaluates whether a response (stored under an "output" column) is correct or incorrect given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).
- SummarizationEvaluator: Evaluates whether a summary (stored under an "output" column) provides an accurate synopsis of an input document (stored under an "input" column).
- SQLEvaluator: Evaluates whether a generated SQL query (stored under the "query_gen" column) and a response (stored under the "response" column) appropriately answer a question (stored under the "question" column).
provide_explanation (bool, optional): If true, each output dataframe will contain an explanation column containing the LLM's reasoning for each evaluation.
use_function_calling_if_available (bool, optional): If true, function calling is used (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.
verbose (bool, optional): If true, prints detailed information such as model invocation parameters, retries on failed requests, etc.
concurrency (int, optional): The number of concurrent workers if async submission is possible. If not provided, a recommended default concurrency is set on a per-model basis.

Returns

List[pandas.DataFrame]: A list of dataframes, one for each evaluator, all of which have the same number of rows as the input dataframe.

Usage

To use run_evals, you must first wrangle your LLM application data into a pandas dataframe either manually or by querying and exporting the spans collected by your Phoenix session. Once your dataframe is wrangled into the appropriate format, you can instantiate your evaluators by passing the model to be used during evaluation.

This example uses OpenAIModel, but you can use any of our supported evaluation models.

from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    QAEvaluator,
    run_evals,
)

api_key = None  # set your api key here or with the OPENAI_API_KEY environment variable
eval_model = OpenAIModel(model_name="gpt-4-turbo-preview", api_key=api_key)

hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)

Run your evaluations by passing your dataframe and your list of desired evaluators.

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=dataframe,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)

Assuming your dataframe contains the "input", "reference", and "output" columns required by HallucinationEvaluator and QAEvaluator, your output dataframes should contain the results of the corresponding evaluator applied to the input dataframe, including columns for labels (e.g., "factual" or "hallucinated"), scores (e.g., 0 for factual labels, 1 for hallucinated labels), and explanations. If your dataframe was exported from your Phoenix session, you can then ingest the evaluations using phoenix.log_evaluations so that the evals will be visible as annotations inside Phoenix.

For an end-to-end example, see the evals quickstart.

phoenix.evals.PromptTemplate

class PromptTemplate(
    text: str
    delimiters: List[str]
)

Class used to store and format prompt templates.

Parameters

text (str): The raw prompt text used as a template.
delimiters (List[str]): List of characters used to locate the variables within the prompt template text. Defaults to ["{", "}"].

Attributes

text (str): The raw prompt text used as a template.
variables (List[str]): The names of the variables that, once their values are substituted into the template, create the prompt text. These variable names are automatically detected from the template text using the delimiters passed when initializing the class (see Usage section below).

Usage

Define a PromptTemplate by passing a text string and the delimiters to use to locate the variables. The default delimiters are { and }.

from phoenix.evals import PromptTemplate

template_text = "My name is {name}. I am {age} years old and I am from {location}."
prompt_template = PromptTemplate(text=template_text)

If the prompt template variables have been correctly located, you can access them as follows:

print(prompt_template.variables)
# Output: ['name', 'age', 'location']

The PromptTemplate class can also understand any combination of delimiters. Following the example above, but getting creative with our delimiters:

template_text = "My name is :/name-!). I am :/age-!) years old and I am from :/location-!)."
prompt_template = PromptTemplate(text=template_text, delimiters=[":/", "-!)"])
print(prompt_template.variables)
# Output: ['name', 'age', 'location']

Once you have a PromptTemplate class instantiated, you can make use of its format method to construct the prompt text resulting from substituting values into the variables. To do so, a dictionary mapping the variable names to the values is passed:

value_dict = {
    "name": "Peter",
    "age": 20,
    "location": "Queens"
}
print(prompt_template.format(value_dict))
# Output: My name is Peter. I am 20 years old and I am from Queens

Note that once you initialize the PromptTemplate class, you don't need to worry about delimiters anymore, it will be handled for you.

phoenix.evals.llm_classify

def llm_classify(
    dataframe: pd.DataFrame,
    model: BaseEvalModel,
    template: Union[ClassificationTemplate, PromptTemplate, str],
    rails: List[str],
    system_instruction: Optional[str] = None,
    verbose: bool = False,
    use_function_calling_if_available: bool = True,
    provide_explanation: bool = False,
) -> pd.DataFrame

Classifies each input row of the dataframe using an LLM. Returns a pandas.DataFrame where the first column is named label and contains the classification labels. An optional column named explanation is added when provide_explanation=True.

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be classified. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (ClassificationTemplate, or str): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to .format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class instance
rails (List[str]): A list of strings representing the possible output classes of the model's predictions.
system_instruction (Optional[str]): An optional system message for modals that support it
verbose (bool, optional): If True, prints detailed info to stdout such as model invocation parameters and details about retries and snapping to rails. Default False.
use_function_calling_if_available (bool, default=True): If True, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.
provide_explanation (bool, default=False): If True, provides an explanation for each classification label. A column named explanation is added to the output dataframe. Note that this will default to using function calling if available. If the model supplied does not support function calling, llm_classify will need a prompt template that prompts for an explanation. For phoenix's pre-tested eval templates, the template is swapped out for a chain-of-thought based template that prompts for an explanation.

Returns

pandas.DataFrame: A dataframe where the label column (at column position 0) contains the classification labels. If provide_explanation=True, then an additional column named explanation is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or "NOT_PARSABLE" if the model's output could not be parsed.

phoenix.evals.llm_generate

def llm_generate(
    dataframe: pd.DataFrame,
    template: Union[PromptTemplate, str],
    model: Optional[BaseEvalModel] = None,
    system_instruction: Optional[str] = None,
    output_parser: Optional[Callable[[str, int], Dict[str, Any]]] = None,
) -> List[str]

Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses

Parameters

dataframe (pandas.DataFrame): A pandas dataframe in which each row represents a record to be used as in input to the template. All template variable names must appear as column names in the dataframe (extra columns unrelated to the template are permitted).
template (Union[PromptTemplate, str]): The prompt template as either an instance of PromptTemplate or a string. If the latter, the variable names should be surrounded by curly braces so that a call to format can be made to substitute variable values.
model (BaseEvalModel): An LLM model class.
system_instruction (Optional[str], optional): An optional system message.
output_parser (Callable[[str, int], Dict[str, Any]], optional): An optional function that takes each generated response and response index and parses it to a dictionary. The keys of the dictionary should correspond to the column names of the output dataframe. If None, the output dataframe will have a single column named "output". Default None.

Returns

generations_dataframe (pandas.DataFrame): A dataframe where each row represents the generated output

Usage

Below we show how you can use llm_generate to use an llm to generate synthetic data. In this example, we use the llm_generate function to generate the capitals of countries but llm_generate can be used to generate any type of data such as synthetic questions, irrelevant responses, and so on.

import pandas as pd
from phoenix.evals import OpenAIModel, llm_generate

countries_df = pd.DataFrame(
    {
        "country": [
            "France",
            "Germany",
            "Italy",
        ]
    }
)

capitals_df = llm_generate(
    dataframe=countries_df,
    template="The capital of {country} is ",
    model=OpenAIModel(model_name="gpt-4"),
    verbose=True,
)

llm_generate also supports an output parser so you can use this to generate data in a structured format. For example, if you want to generate data in JSON format, you ca prompt for a JSON object and then parse the output using the json library.

import json
from typing import Dict

import pandas as pd
from phoenix.evals import OpenAIModel, PromptTemplate, llm_generate


def output_parser(response: str) -> Dict[str, str]:
        try:
            return json.loads(response)
        except json.JSONDecodeError as e:
            return {"__error__": str(e)}

countries_df = pd.DataFrame(
    {
        "country": [
            "France",
            "Germany",
            "Italy",
        ]
    }
)

template = PromptTemplate("""
Given the country {country}, output the capital city and a description of that city.
The output must be in JSON format with the following keys: "capital" and "description".

response:
""")

capitals_df = llm_generate(
    dataframe=countries_df,
    template=template,
    model=OpenAIModel(
        model_name="gpt-4-turbo-preview",
        model_kwargs={
            "response_format": {"type": "json_object"}
        }
        ),
    output_parser=output_parser
)

Export Data & Query Spans

Various options for to help you get data out of Phoenix

Options for Exporting Data from Phoenix

Method

Description

Helpful for

Exports all spans in a project as a dataframe

Evaluation - Filtering your spans locally using pandas instead of Phoenix DSL.

Exports specific spans or traces based on filters

Evaluation - Querying spans from Phoenix

Exports specific groups of spans

Agent Evaluation - Easily export tool calls.

RAG Evaluation - Easily exporting retrieved documents or Q&A data from a RAG system.

Saves all traces as a local file

Storing Data - Backing up an entire Phoenix instance.

Connect to Phoenix

Before using any of the methods above, make sure you've connected to px.Client() . You'll need to set the following environment variables:

import os

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.

Downloading all Spans as a Dataframe

If you prefer to handle your filtering locally, you can also download all spans as a dataframe using the get_spans_dataframe() function:

import phoenix as px

# Download all spans from your default project
px.Client().get_spans_dataframe()

# Download all spans from a specific project
px.Client().get_spans_dataframe(project_name='your project name')

# You can query for spans with the same filter conditions as in the UI
px.Client().get_spans_dataframe("span_kind == 'CHAIN'")

Running Span Queries

You can query for data using our query DSL (domain specific language).

This Query DSL is the same as what is used by the filter bar in the dashboard. It can be helpful to form your query string in the Phoenix dashboard for more immediate feedback, before moving it to code.

Below is an example of how to pull all retriever spans and select the input value. The output of this query is a DataFrame that contains the input values for all retriever spans.

import phoenix as px
from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    # Filter for the `RETRIEVER` span kind.
    # The filter condition is a string of valid Python boolean expression.
    "span_kind == 'RETRIEVER'",
).select(
    # Extract the span attribute `input.value` which contains the query for the
    # retriever. Rename it as the `input` column in the output dataframe.
    input="input.value",
)

# The Phoenix Client can take this query and return the dataframe.
px.Client().query_spans(query)

DataFrame Index By default, the result DataFrame is indexed by span_id, and if .explode() is used, the index from the exploded list is added to create a multi-index on the result DataFrame. For the special retrieval.documents span attribute, the added index is renamed as document_position.

How to Specify a Time Range

By default, all queries will collect all spans that are in your Phoenix instance. If you'd like to focus on most recent spans, you can pull spans based on time frames using start_time and end_time.

import phoenix as px
from phoenix.trace.dsl import SpanQuery
from datetime import datetime, timedelta

# Initiate Phoenix client
px_client = px.Client()

# Get spans from the last 7 days only
start = datetime.now() - timedelta(days=7)

# Get spans to exclude the last 24 hours
end = datetime.now() - timedelta(days=1)

phoenix_df = px_client.query_spans(start_time=start, end_time=end)

How to Specify a Project

By default all queries are executed against the default project or the project set via the PHOENIX_PROJECT_NAME environment variable. If you choose to pull from a different project, all methods on the Client have an optional parameter named project_name

import phoenix as px
from phoenix.trace.dsl import SpanQuery

# Get spans from a project
px.Client().get_spans_dataframe(project_name="<my-project>")

# Using the query DSL
query = SpanQuery().where("span_kind == 'CHAIN'").select(input="input.value")
px.Client().query_spans(query, project_name="<my-project>")

Querying for Retrieved Documents

Let's say we want to extract the retrieved documents into a DataFrame that looks something like the table below, where input denotes the query for the retriever, reference denotes the content of each document, and document_position denotes the (zero-based) index in each span's list of retrieved documents.

Note that this DataFrame can be used directly as input for the Retrieval (RAG) Relevance evaluations.

context.span_id

document_position

input

reference

5B8EF798A381

What was the author's motivation for writing ...

In fact, I decided to write a book about ...

5B8EF798A381

What was the author's motivation for writing ...

I started writing essays again, and wrote a bunch of ...

...

E19B7EC3GG02

What did the author learn about ...

The good part was that I got paid huge amounts of ...

We can accomplish this with a simple query as follows. Also see Predefined Queries for a helper function executing this query.

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    # Filter for the `RETRIEVER` span kind.
    # The filter condition is a string of valid Python boolean expression.
    "span_kind == 'RETRIEVER'",
).select(
    # Extract the span attribute `input.value` which contains the query for the
    # retriever. Rename it as the `input` column in the output dataframe.
    input="input.value",
).explode(
    # Specify the span attribute `retrieval.documents` which contains a list of
    # objects and explode the list. Extract the `document.content` attribute from
    # each object and rename it as the `reference` column in the output dataframe.
    "retrieval.documents",
    reference="document.content",
)

# The Phoenix Client can take this query and return the dataframe.
px.Client().query_spans(query)

How to Explode Attributes

In addition to the document content, if we also want to explode the document score, we can simply add the document.score attribute to the .explode() method alongside document.content as follows. Keyword arguments are necessary to name the output columns, and in this example we name the output columns as reference and score. (Python's double-asterisk unpacking idiom can be used to specify arbitrary output names containing spaces or symbols. See here for an example.)

query = SpanQuery().explode(
    "retrieval.documents",
    reference="document.content",
    score="document.score",
)

How to Apply Filters

The .where() method accepts a string of valid Python boolean expression. The expression can be arbitrarily complex, but restrictions apply, e.g. making function calls are generally disallowed. Below is a conjunction filtering also on whether the input value contains the string 'programming'.

query = SpanQuery().where(
    "span_kind == 'RETRIEVER' and 'programming' in input.value"
)

Filtering Spans by Evaluation Results

Filtering spans by evaluation results, e.g. score or label, can be done via a special syntax. The name of the evaluation is specified as an indexer on the special keyword evals. The example below filters for spans with the incorrect label on their correctness evaluations. (See here for how to compute evaluations for traces, and here for how to ingest those results back to Phoenix.)

query = SpanQuery().where(
    "evals['correctness'].label == 'incorrect'"
)

Filtering on Metadata

metadata is an attribute that is a dictionary and it can be filtered like a dictionary.

query = SpanQuery().where(
    "metadata["topic"] == 'programming'"
)

Filtering for Substring

Note that Python strings do not have a contain method, and substring search is done with the in operator.

query = SpanQuery().where(
    "'programming' in metadata["topic"]"
)

Filtering for No Evaluations

Get spans that do not have an evaluation attached yet

query = SpanQuery().where(
    "evals['correctness'].label is None"
)
# correctness is whatever you named your evaluation metric

How to Apply Filters (UI)

You can also use Python boolean expressions to filter spans in the Phoenix UI. These expressions can be entered directly into the search bar above your experiment runs, allowing you to apply complex conditions involving span attributes. Any expressions that work with the .where() method above can also be used in the UI.

How to Extract Attributes

Span attributes can be selected by simply listing them inside .select() method.

query = SpanQuery().select(
    "input.value",
    "output.value",
)

Renaming Output Columns

Keyword-argument style can be used to rename the columns in the dataframe. The example below returns two columns named input and output instead of the original names of the attributes.

query = SpanQuery().select(
    input="input.value",
    output="output.value",
)

Arbitrary Output Column Names

If arbitrary output names are desired, e.g. names with spaces and symbols, we can leverage Python's double-asterisk idiom for unpacking a dictionary, as shown below.

query = SpanQuery().select(**{
    "Value (Input)": "input.value",
    "Value (Output)": "output.value",
})

Advanced Usage

Concatenating

The document contents can also be concatenated together. The query below concatenates the list of document.content with (double newlines), which is the default separator. Keyword arguments are necessary to name the output columns, and in this example we name the output column as reference. (Python's double-asterisk unpacking idiom can be used to specify arbitrary output names containing spaces or symbols. See here for an example.)

query = SpanQuery().concat(
    "retrieval.documents",
    reference="document.content",
)

Special Separators

If a different separator is desired, say \n************, it can be specified as follows.

query = SpanQuery().concat(
    "retrieval.documents",
    reference="document.content",
).with_concat_separator(
    separator="\n************\n",
)

Using Parent ID as Index

This is useful for joining a span to its parent span. To do that we would first index the child span by selecting its parent ID and renaming it as span_id. This works because span_id is a special column name: whichever column having that name will become the index of the output DataFrame.

query = SpanQuery().select(
    span_id="parent_id",
    output="output.value",
)

Joining a Span to Its Parent

To do this, we would provide two queries to Phoenix which will return two simultaneous dataframes that can be joined together by pandas. The query_for_child_spans uses parent_id as index as shown in Using Parent ID as Index, and px.Client().query_spans() returns a list of dataframes when multiple queries are given.

import pandas as pd

pd.concatenate(
    px.Client().query_spans(
        query_for_parent_spans,
        query_for_child_spans,
    ),
    axis=1,        # joining on the row indices
    join="inner",  # inner-join by the indices of the dataframes
)

How to use Data for Evaluation

Extract the Input and Output from LLM Spans

To learn more about extracting span attributes, see Extracting Span Attributes.

from phoenix.trace.dsl import SpanQuery

query = SpanQuery().where(
    "span_kind == 'LLM'",
).select(
    input="input.value",
    output="output.value,
)

# The Phoenix Client can take this query and return a dataframe.
px.Client().query_spans(query)

Retrieval (RAG) Relevance Evaluations

To extract the dataframe input for Retrieval (RAG) Relevance evaluations, we can apply the query described in the Example, or leverage the helper function implementing the same query.

Q&A on Retrieved Data Evaluations

To extract the dataframe input to the Q&A on Retrieved Data evaluations, we can use a helper function or use the following query (which is what's inside the helper function). This query applies techniques described in the Advanced Usage section.

import pandas as pd
from phoenix.trace.dsl import SpanQuery

query_for_root_span = SpanQuery().where(
    "parent_id is None",   # Filter for root spans
).select(
    input="input.value",   # Input contains the user's question
    output="output.value", # Output contains the LLM's answer
)

query_for_retrieved_documents = SpanQuery().where(
    "span_kind == 'RETRIEVER'",  # Filter for RETRIEVER span
).select(
    # Rename parent_id as span_id. This turns the parent_id
    # values into the index of the output dataframe.
    span_id="parent_id",
).concat(
    "retrieval.documents",
    reference="document.content",
)

# Perform an inner join on the two sets of spans.
pd.concat(
    px.Client().query_spans(
        query_for_root_span,
        query_for_retrieved_documents,
    ),
    axis=1,
    join="inner",
)

Pre-defined Queries

Phoenix also provides helper functions that executes predefined queries for the following use cases.

If you need to run the query against a specific project, you can add the project_name as a parameter to any of the pre-defined queries

Tool Calls

The query below will automatically export any tool calls selected by LLM calls. The output DataFrame can be easily combined with Agent Function Calling Eval.

from phoenix.trace.dsl.helpers import get_called_tools

tools_df = get_called_tools(client)
tools_df

Retrieved Documents

The query shown in the example can be done more simply with a helper function as follows. The output DataFrame can be used directly as input for the Retrieval (RAG) Relevance evaluations.

from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents = get_retrieved_documents(px.Client())
retrieved_documents

Q&A on Retrieved Data

To extract the dataframe input to the Q&A on Retrieved Data evaluations, we can use the following helper function.

from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference = get_qa_with_reference(px.Client())
qa_with_reference

The output DataFrame would look something like the one below. The input contains contains the question, the output column contains the answer, and the reference column contains a concatenation of all the retrieved documents. This helper function assumes that the questions and answers are the input.value and output.value attributes of the root spans, and the list of retrieved documents are contained in a direct child span of the root span. (The helper function applies the techniques described in the Advanced Usage section.)

context.span_id

input

output

reference

CDBC4CE34

What was the author's trick for ...

The author's trick for ...

Even then it took me several years to understand ...

...

Save All Traces

Sometimes you may want to back up your Phoenix traces to a single file, rather than exporting specific spans to run evaluation.

Use the following command to save all traces from a Phoenix instance to a designated location.

my_traces = px.Client().get_trace_dataset().save()

You can specify the directory to save your traces by passing adirectory argument to the save method.

import os

# Specify and Create the Directory for Trace Dataset
directory = '/my_saved_traces'
os.makedirs(directory, exist_ok=True)

# Save the Trace Dataset
trace_id = px.Client().get_trace_dataset().save(directory=directory)

This output the trace ID and prints the path of the saved file:

💾 Trace dataset saved to under ID: f7733fda-6ad6-4427-a803-55ad2182b662

📂 Trace dataset path: /my_saved_traces/trace_dataset-f7733fda-6ad6-4427-a803-55ad2182b662.parquet

Setup Tracing (TS)

Phoenix is written and maintained in Python to make it natively runnable in Python notebooks. However, it can be stood up as a trace collector so that your LLM traces from your NodeJS application (e.g., LlamaIndex.TS, Langchain.js) can be collected. The traces collected by Phoenix can then be downloaded to a Jupyter notebook and used to run evaluations (e.g., LLM Evals, Ragas).

Getting Started

Instrumentation is the act of adding observability code to an app yourself.

If you’re instrumenting an app, you need to use the OpenTelemetry SDK for your language. You’ll then use the SDK to initialize OpenTelemetry and the API to instrument your code. This will emit telemetry from your app, and any library you installed that also comes with instrumentation.

Phoenix natively supports automatic instrumentation provided by OpenInference. For more details on OpenInference, checkout the project on GitHub.

Now lets walk through instrumenting, and then tracing, a sample express application.

instrumentation setup

Dependencies

Install OpenTelemetry API packages:

# npm, pnpm, yarn, etc
npm install @opentelemetry/semantic-conventions @opentelemetry/api @opentelemetry/instrumentation @opentelemetry/resources @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-proto

Install OpenInference instrumentation packages. Below is an example of adding instrumentation for OpenAI as well as the semantic conventions for OpenInference.

# npm, pnpm, yarn, etc
npm install openai @arizeai/openinference-instrumentation-openai @arizeai/openinference-semantic-conventions

Traces

Initialize Tracing

To enable tracing in your app, you’ll need to have an initialized TracerProvider.

If a TracerProvider is not created, the OpenTelemetry APIs for tracing will use a no-op implementation and fail to generate data. As explained next, create an instrumentation.ts (or instrumentation.js) file to include all of the provider initialization code in Node.

Node.js

Create instrumentation.ts (or instrumentation.js) to contain all the provider initialization code:

// instrumentation.ts
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-proto";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
import { SEMRESATTRS_PROJECT_NAME } from "@arizeai/openinference-semantic-conventions";
import OpenAI from "openai";

// For troubleshooting, set the log level to DiagLogLevel.DEBUG
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

const tracerProvider = new NodeTracerProvider({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "openai-service",
    // Project name in Phoenix, defaults to "default"
    [SEMRESATTRS_PROJECT_NAME]: "openai-service",
  }),
  spanProcessors: [
    // BatchSpanProcessor will flush spans in batches after some time,
    // this is recommended in production. For development or testing purposes
    // you may try SimpleSpanProcessor for instant span flushing to the Phoenix UI.
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: `http://localhost:6006/v1/traces`,
        // (optional) if connecting to Phoenix Cloud
        // headers: { "api_key": process.env.PHOENIX_API_KEY },
        // (optional) if connecting to self-hosted Phoenix with Authentication enabled
        // headers: { "Authorization": `Bearer ${process.env.PHOENIX_API_KEY}` }
      })
    ),
  ],
});
tracerProvider.register();

const instrumentation = new OpenAIInstrumentation();
instrumentation.manuallyInstrument(OpenAI);

registerInstrumentations({
  instrumentations: [instrumentation],
});

console.log("👀 OpenInference initialized");

This basic setup has will instrument chat completions via native calls to the OpenAI client.

As shown above with OpenAI, you can register additional instrumentation libraries with the OpenTelemetry provider in order to generate telemetry data for your dependencies. For more information, see Integrations.

Picking the right span processor

In our instrumentation.ts file above, we use the BatchSpanProcessor. The BatchSpanProcessor processes spans in batches before they are exported. This is usually the right processor to use for an application.

In contrast, the SimpleSpanProcessor processes spans as they are created. This means that if you create 5 spans, each will be processed and exported before the next span is created in code. This can be helpful in scenarios where you do not want to risk losing a batch, or if you’re experimenting with OpenTelemetry in development. However, it also comes with potentially significant overhead, especially if spans are being exported over a network - each time a call to create a span is made, it would be processed and sent over a network before your app’s execution could continue.

In most cases, stick with BatchSpanProcessor over SimpleSpanProcessor.

Tracing instrumented libraries

Now that you have configured a tracer provider, and instrumented the openai package, lets see how we can generate traces for a sample application.

The following code assumes you have Phoenix running locally, on its default port of 6006. See our Quickstart: Tracing (TS) documentation if you'd like to learn more about running Phoenix.

First, install the dependencies required for our sample app.

# npm, pnpm, yarn, etc
npm install express

Next, create an app.ts (or app.js ) file, that hosts a simple express server for executing OpenAI chat completions.

// app.ts
import express from "express";
import OpenAI from "openai";

const PORT: number = parseInt(process.env.PORT || "8080");
const app = express();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.get("/chat", async (req, res) => {
  const message = req.query.message;
  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: "user", content: message }],
    model: "gpt-4o",
  });
  res.send(chatCompletion.choices[0].message.content);
});

app.listen(PORT, () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Then, we will start our application, loading the instrumentation.ts file before app.ts so that our instrumentation code can instrument openai .

# node v23
node --require ./instrumentation.ts app.ts

We are using Node v23 above as this allows us to execute TypeScript code without a transpilation step. OpenTelemetry and OpenInference support Node versions from v18 onwards, and we are flexible with projects configured using CommonJS or ESM module syntaxes.

Learn more by visiting the Node.js documentation on TypeScript and ESM or see our Quickstart: Tracing (TS) documentation for an end to end example.

Finally, we can execute a request against our server

curl "http://localhost:8080/chat?message=write%20me%20a%20haiku"

After a few moments, a new project openai-service will appear in the Phoenix UI, along with the trace generated by our OpenAI chat completion!

Advanced: Manually Tracing

Acquiring a tracer

Anywhere in your application where you write manual tracing code should call getTracer to acquire a tracer. For example:

import opentelemetry from '@opentelemetry/api';
//...

const tracer = opentelemetry.trace.getTracer(
  'instrumentation-scope-name',
  'instrumentation-scope-version',
);

// You can now use a 'tracer' to do tracing!

The values of instrumentation-scope-name and instrumentation-scope-version should uniquely identify the Instrumentation Scope, such as the package, module or class name. While the name is required, the version is still recommended despite being optional.

It’s generally recommended to call getTracer in your app when you need it rather than exporting the tracer instance to the rest of your app. This helps avoid trickier application load issues when other required dependencies are involved.

Below is an example of acquiring a tracer within application scope.

// app.ts
import { trace } from '@opentelemetry/api';
import express from 'express';
import OpenAI from "openai";

const tracer = trace.getTracer('llm-server', '0.1.0');

const PORT: number = parseInt(process.env.PORT || "8080");
const app = express();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.get("/chat", async (req, res) => {
  const message = req.query.message;
  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: "user", content: message }],
    model: "gpt-4o",
  });
  res.send(chatCompletion.choices[0].message.content);
});

app.listen(PORT, () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Create spans

Now that you have tracers initialized, you can create spans.

The API of OpenTelemetry JavaScript exposes two methods that allow you to create spans:

tracer.startSpan: Starts a new span without setting it on context.
tracer.startActiveSpan: Starts a new span and calls the given callback function passing it the created span as first argument. The new span gets set in context and this context is activated for the duration of the function call.

In most cases you want to use the latter (tracer.startActiveSpan), as it takes care of setting the span and its context active.

The code below illustrates how to create an active span.

import { trace, Span } from "@opentelemetry/api";
import { SpanKind } from "@opentelemetry/api";
import {
    SemanticConventions,
    OpenInferenceSpanKind,
} from "@arizeai/openinference-semantic-conventions";

export function chat(message: string) {
    // Create a span. A span must be closed.
    return tracer.startActiveSpan(
        "chat",
        (span: Span) => {
            span.setAttributes({
                [SemanticConventions.OPENINFERENCE_SPAN_KIND]: OpenInferenceSpanKind.chain,
                [SemanticConventions.INPUT_VALUE]: message,
            });
            let chatCompletion = await openai.chat.completions.create({
                messages: [{ role: "user", content: message }],
                model: "gpt-3.5-turbo",
            });
            span.setAttributes({
                attributes: {
                    [SemanticConventions.OUTPUT_VALUE]: chatCompletion.choices[0].message,
                },
            });
            // Be sure to end the span!
            span.end();
            return result;
        }
    );
}

The above instrumented code can now be pasted in the /chat handler. You should now be able to see spans emitted from your app.

Start your app as follows, and then send it requests by visiting http://localhost:8080/chat?message="how long is a pencil" with your browser or curl.

ts-node --require ./instrumentation.ts app.ts

After a while, you should see the spans printed in the console by the ConsoleSpanExporter, something like this:

{
  "traceId": "6cc927a05e7f573e63f806a2e9bb7da8",
  "parentId": undefined,
  "name": "chat",
  "id": "117d98e8add5dc80",
  "kind": 0,
  "timestamp": 1688386291908349,
  "duration": 501,
  "attributes": {
    "openinference.span.kind": "chain"
    "input.value": "how long is a pencil"
  },
  "status": { "code": 0 },
  "events": [],
  "links": []
}

Get the current span

Sometimes it’s helpful to do something with the current/active span at a particular point in program execution.

const activeSpan = opentelemetry.trace.getActiveSpan();

// do something with the active span, optionally ending it if that is appropriate for your use case.

Get a span from context

It can also be helpful to get the span from a given context that isn’t necessarily the active span.

const ctx = getContextFromSomewhere();
const span = opentelemetry.trace.getSpan(ctx);

// do something with the acquired span, optionally ending it if that is appropriate for your use case.

Attributes

Attributes let you attach key/value pairs to a Span so it carries more information about the current operation that it’s tracking. For OpenInference related attributes, use the @arizeai/openinference-semantic-conventions keys. However you are free to add any attributes you'd like!

function chat(message: string, user: User) {
  return tracer.startActiveSpan(`chat:${i}`, (span: Span) => {
    const result = Math.floor(Math.random() * (max - min) + min);

    // Add an attribute to the span
    span.setAttribute('mycompany.userid', user.id);

    span.end();
    return result;
  });
}

You can also add attributes to a span as it’s created:

tracer.startActiveSpan(
  'app.new-span',
  { attributes: { attribute1: 'value1' } },
  (span) => {
    // do some work...

    span.end();
  },
);

function chat(session: Session) {
  return tracer.startActiveSpan(
    'chat',
    { attributes: { 'mycompany.sessionid': session.id } },
    (span: Span) => {
      /* ... */
    },
  );
}

Semantic Attributes

There are semantic conventions for spans representing operations in well-known protocols like HTTP or database calls. OpenInference also publishes it's own set of semantic conventions related to LLM applications. Semantic conventions for these spans are defined in the specification under OpenInference. In the simple example of this guide the source code attributes can be used.

First add both semantic conventions as a dependency to your application:

npm install --save @opentelemetry/semantic-conventions @arizeai/openinfernece-semantic-conventions

Add the following to the top of your application file:

import { SemanticAttributes } from 'arizeai/openinfernece-semantic-conventions';

Finally, you can update your file to include semantic attributes:

const doWork = () => {
  tracer.startActiveSpan('app.doWork', (span) => {
    span.setAttribute(SemanticAttributes.INPUT_VALUE, 'work input');
    // Do some work...

    span.end();
  });
};

Span events

A Span Event is a human-readable message on an Span that represents a discrete event with no duration that can be tracked by a single timestamp. You can think of it like a primitive log.

span.addEvent('Doing something');

const result = doWork();

You can also create Span Events with additional Attributes

While Phoenix captures these, they are currently not displayed in the UI. Contact us if you would like to support!

span.addEvent('some log', {
  'log.severity': 'error',
  'log.message': 'Data not found',
  'request.id': requestId,
});

Span Status

A Status can be set on a Span, typically used to specify that a Span has not completed successfully - Error. By default, all spans are Unset, which means a span completed without error. The Ok status is reserved for when you need to explicitly mark a span as successful rather than stick with the default of Unset (i.e., “without error”).

The status can be set at any time before the span is finished.

import opentelemetry, { SpanStatusCode } from '@opentelemetry/api';

// ...

tracer.startActiveSpan('app.doWork', (span) => {
  for (let i = 0; i <= Math.floor(Math.random() * 40000000); i += 1) {
    if (i > 10000) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: 'Error',
      });
    }
  }

  span.end();
});

Recording exceptions

It can be a good idea to record exceptions when they happen. It’s recommended to do this in conjunction with setting span status.

import opentelemetry, { SpanStatusCode } from '@opentelemetry/api';

// ...

try {
  doWork();
} catch (ex) {
  span.recordException(ex);
  span.setStatus({ code: SpanStatusCode.ERROR });
}

Using `sdk-trace-base` and manually propagating span context

In some cases, you may not be able to use either the Node.js SDK nor the Web SDK. The biggest difference, aside from initialization code, is that you’ll have to manually set spans as active in the current context to be able to create nested spans.

Initializing tracing with sdk-trace-base

Initializing tracing is similar to how you’d do it with Node.js or the Web SDK.

import opentelemetry from '@opentelemetry/api';
import {
  BasicTracerProvider,
  BatchSpanProcessor,
  ConsoleSpanExporter,
} from '@opentelemetry/sdk-trace-base';

const provider = new BasicTracerProvider();

// Configure span processor to send spans to the exporter
provider.addSpanProcessor(new BatchSpanProcessor(new ConsoleSpanExporter()));
provider.register();

// This is what we'll access in all instrumentation code
const tracer = opentelemetry.trace.getTracer('example-basic-tracer-node');

Like the other examples in this document, this exports a tracer you can use throughout the app.

Creating nested spans with sdk-trace-base

To create nested spans, you need to set whatever the currently-created span is as the active span in the current context. Don’t bother using startActiveSpan because it won’t do this for you.

const mainWork = () => {
  const parentSpan = tracer.startSpan('main');

  for (let i = 0; i < 3; i += 1) {
    doWork(parentSpan, i);
  }

  // Be sure to end the parent span!
  parentSpan.end();
};

const doWork = (parent, i) => {
  // To create a child span, we need to mark the current (parent) span as the active span
  // in the context, then use the resulting context to create a child span.
  const ctx = opentelemetry.trace.setSpan(
    opentelemetry.context.active(),
    parent,
  );
  const span = tracer.startSpan(`doWork:${i}`, undefined, ctx);

  // simulate some random work.
  for (let i = 0; i <= Math.floor(Math.random() * 40000000); i += 1) {
    // empty
  }

  // Make sure to end this child span! If you don't,
  // it will continue to track work beyond 'doWork'!
  span.end();
};

All other APIs behave the same when you use sdk-trace-base compared with the Node.js SDKs.

Specifying a Custom Tracer Provider

OpenInference JavaScript instrumentations support specifying a custom tracer provider in multiple ways. This is useful when you need to use a different tracer provider than the default global one, or when you want to have more control over the tracing configuration.

Method 1: Pass tracerProvider on instantiation

You can pass a custom tracer provider directly to the instrumentation when creating it:

// Create a custom tracer provider
const customTracerProvider = new NodeTracerProvider({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "custom-service",
    [SEMRESATTRS_PROJECT_NAME]: "custom-project",
  }),
  spanProcessors: [
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: `http://localhost:6006/v1/traces`,
      })
    ),
  ],
});

// Pass the custom tracer provider to the instrumentation
const instrumentation = new OpenAIInstrumentation({
  tracerProvider: customTracerProvider,
});
instrumentation.manuallyInstrument(OpenAI);

Method 2: Set tracerProvider after instantiation

You can set a tracer provider after creating the instrumentation:

const instrumentation = new OpenAIInstrumentation();
instrumentation.setTracerProvider(customTracerProvider);
instrumentation.manuallyInstrument(OpenAI);

Method 3: Pass tracerProvider to registerInstrumentations

You can also specify the tracer provider when registering instrumentations:

const instrumentation = new OpenAIInstrumentation();
instrumentation.manuallyInstrument(OpenAI);

registerInstrumentations({
  instrumentations: [instrumentation],
  tracerProvider: customTracerProvider,
});

Supported Instrumentations

This functionality is supported across all OpenInference JavaScript instrumentations:

LangChain JS: @arizeai/openinference-instrumentation-langchain
BeeAI: @arizeai/openinference-instrumentation-beeai
OpenAI JS: @arizeai/openinference-instrumentation-openai

For specific examples with each instrumentation, see their respective documentation pages in the Integrations section.

Import Your Data

How to create Phoenix inferences and schemas for common data formats

This guide shows you how to define Phoenix inferences using your own data.

For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of inferences and schemas, see here.
For a comprehensive description of phoenix.Dataset and phoenix.Schema, see the API reference.

Once you have a pandas dataframe df containing your data and a schema object describing the format of your dataframe, you can define your Phoenix dataset either by running

ds = px.Inferences(df, schema)

or by optionally providing a name for your dataset that will appear in the UI:

ds = px.Inferences(df, schema, name="training")

As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.

Predictions and Actuals

Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime objects that represent the time at which each inference was made in production.

Dataframe

timestamp

prediction_score

prediction

target

2023-03-01 02:02:19

0.91

click

2023-02-17 23:45:48

0.37

no_click

2023-01-30 15:30:03

0.54

click

no_click

2023-02-03 19:56:09

0.74

click

2023-02-24 04:23:43

0.37

no_click

click

Schema

schema = px.Schema(
    timestamp_column_name="timestamp",
    prediction_score_column_name="prediction_score",
    prediction_label_column_name="prediction",
    actual_label_column_name="target",
)

This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.

Features and Tags

Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.

Dataframe

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

age

gender

predicted

target

578

Scammeds

4300

62966

RENT

110

male

not_fraud

fraud

507

Schiller Ltd

21000

52335

RENT

129

female

not_fraud

656

Kirlin and Sons

18000

94995

MORTGAGE

female

uncertain

414

Scammeds

18000

32034

LEASE

male

fraud

not_fraud

512

Champlin and Sons

20000

46005

OWN

148

male

uncertain

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    feature_column_names=[
        "fico_score",
        "merchant_id",
        "loan_amount",
        "annual_income",
        "home_ownership",
        "num_credit_lines",
        "inquests_in_last_6_months",
        "months_since_last_delinquency",
    ],
    tag_column_names=[
        "age",
        "gender",
    ],
)

Implicit Features

If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names field of your schema set to its default value of None, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.

Dataframe

target

predicted

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
)

Excluded Columns

You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.

Dataframe

target

predicted

hospital

insurance_provider

mean radius

mean texture

mean perimeter

mean area

mean smoothness

mean compactness

mean concavity

mean concave points

mean symmetry

mean fractal dimension

radius error

texture error

perimeter error

area error

smoothness error

compactness error

concavity error

concave points error

symmetry error

fractal dimension error

worst radius

worst texture

worst perimeter

worst area

worst smoothness

worst compactness

worst concavity

worst concave points

worst symmetry

worst fractal dimension

malignant

benign

Pacific Clinics

uninsured

15.49

19.97

102.40

744.7

0.11600

0.15620

0.18910

0.09113

0.1929

0.06744

0.6470

1.3310

4.675

66.91

0.007269

0.02928

0.04972

0.01639

0.01852

0.004232

21.20

29.41

142.10

1359.0

0.1681

0.3913

0.55530

0.21210

0.3187

0.10190

malignant

Queens Hospital

Anthem Blue Cross

17.01

20.26

109.70

904.3

0.08772

0.07304

0.06950

0.05390

0.2026

0.05223

0.5858

0.8554

4.106

68.46

0.005038

0.01503

0.01946

0.01123

0.02294

0.002581

19.80

25.05

130.00

1210.0

0.1111

0.1486

0.19320

0.10960

0.3275

0.06469

malignant

St. Francis Memorial Hospital

Blue Shield of CA

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.30010

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.71190

0.26540

0.4601

0.11890

benign

Pacific Clinics

Kaiser Permanente

14.53

13.98

93.86

644.2

0.10990

0.09242

0.06895

0.06495

0.1650

0.06121

0.3060

0.7213

2.143

25.70

0.006133

0.01251

0.01615

0.01136

0.02207

0.003563

15.80

16.93

103.10

749.9

0.1347

0.1478

0.13730

0.10690

0.2606

0.07810

benign

CityMed

Anthem Blue Cross

10.26

14.71

66.20

321.6

0.09882

0.09159

0.03581

0.02037

0.1633

0.07005

0.3380

2.5090

2.394

19.33

0.017360

0.04671

0.02611

0.01296

0.03675

0.006758

10.88

19.48

70.89

357.1

0.1360

0.1636

0.07162

0.04074

0.2434

0.08488

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    excluded_column_names=[
        "hospital",
        "insurance_provider",
    ],
)

Embedding Features

Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames to associate multiple dataframe columns with the same embedding feature.

For a conceptual overview of embeddings, see Embeddings.
For a comprehensive description of px.EmbeddingColumnNames, see the API reference.

The example in this section contain low-dimensional embeddings for the sake of easy viewing. Your embeddings in practice will typically have much higher dimension.

Embedding Vectors

To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name field on px.EmbeddingColumnNames. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:

Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names is a dictionary.
The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
The values of this dictionary are instances of px.EmbeddingColumnNames.
Each entry in the "embedding_vector" column is a list of length 4.

Dataframe

predicted

target

embedding_vector

fico_score

merchant_id

loan_amount

annual_income

home_ownership

num_credit_lines

inquests_in_last_6_months

months_since_last_delinquency

fraud

not_fraud

[-0.97, 3.98, -0.03, 2.92]

604

Leannon Ward

22000

100781

RENT

108

fraud

not_fraud

[3.20, 3.95, 2.81, -0.09]

612

Scammeds

7500

116184

MORTGAGE

not_fraud

[-0.49, -0.62, 0.08, 2.03]

646

Leannon Ward

32000

73666

RENT

131

not_fraud

[1.69, 0.01, -0.76, 3.64]

560

Kirlin and Sons

19000

38589

MORTGAGE

131

uncertain

[1.46, 0.69, 3.26, -0.17]

636

Champlin and Sons

10000

100251

MORTGAGE

Schema

schema = px.Schema(
    prediction_label_column_name="predicted",
    actual_label_column_name="target",
    embedding_feature_column_names={
        "transaction_embeddings": px.EmbeddingColumnNames(
            vector_column_name="embedding_vector"
        ),
    },
)

The features in this example are implicitly inferred to be the columns of the dataframe that do not appear in the schema.

To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.

Embeddings of Images

If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name field on px.EmbeddingColumnNames. The following example contains data for an image classification model that detects product defects on an assembly line.

Dataframe

defective

image

image_vector

okay

https://www.example.com/image0.jpeg

[1.73, 2.67, 2.91, 1.79, 1.29]

defective

https://www.example.com/image1.jpeg

[2.18, -0.21, 0.87, 3.84, -0.97]

okay

https://www.example.com/image2.jpeg

[3.36, -0.62, 2.40, -0.94, 3.69]

defective

https://www.example.com/image3.jpeg

[2.77, 2.79, 3.36, 0.60, 3.10]

okay

https://www.example.com/image4.jpeg

[1.79, 2.06, 0.53, 3.58, 0.24]

Schema

schema = px.Schema(
    actual_label_column_name="defective",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Local Images

For local image data, we recommend the following steps to serve your images via a local HTTP server:

In your terminal, navigate to a directory containing your image data and run python -m http.server 8000.
Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.

For example, suppose your HTTP server is running in a directory with the following contents:

.
└── image-data
    └── example_image.jpeg

Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.

Embeddings of Text

If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name field on px.EmbeddingColumnNames. The embeddings below were generated by a sentiment classification model trained on product reviews.

Dataframe

name

text

text_vector

Schema

schema = px.Schema(
    actual_label_column_name="sentiment",
    feature_column_names=[
        "category",
    ],
    tag_column_names=[
        "name",
    ],
    embedding_feature_column_names={
        "product_review_embeddings": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        ),
    },
)

Multiple Embedding Features

Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.

Dataframe

name

description

description_vector

image

image_vector

Magic Lamp

Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.

[2.47, -0.01, -0.22, 0.93]

https://www.example.com/image0.jpeg

[2.42, 1.95, 0.81, 2.60, 0.27]

Ergo Desk Chair

The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.

[-0.25, 0.07, 2.90, 1.57]

https://www.example.com/image1.jpeg

[3.17, 2.75, 1.39, 0.44, 3.30]

Cloud Nine Mattress

Our Cloud Nine Mattress combines cool comfort with maximum affordability.

[1.36, -0.88, -0.45, 0.84]

https://www.example.com/image2.jpeg

[-0.22, 0.87, 1.10, -0.78, 1.25]

Dr. Fresh's Spearmint Toothpaste

Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula

[-0.39, 1.29, 0.92, 2.51]

https://www.example.com/image3.jpeg

[1.95, 2.66, 3.97, 0.90, 2.86]

Ultra-Fuzzy Bath Mat

The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom.

[0.37, 3.22, 1.29, 0.65]

https://www.example.com/image4.jpeg

[0.77, 1.79, 0.52, 3.79, 0.47]

Schema

schema = px.Schema(
    tag_column_names=["name"],
    embedding_feature_column_names={
        "description_embedding": px.EmbeddingColumnNames(
            vector_column_name="description_vector",
            raw_data_column_name="description",
        ),
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="image",
        ),
    },
)

Distinct embedding features may have embedding vectors of differing length. The text embeddings in the above example have length 4 while the image embeddings have length 5.

English

Arize Phoenix

Features

Quickstarts

Next Steps

Try our Tutorials

Add Integrations

Community

Quickstarts

User Guide

🛠️ Develop

🧪 Testing/Staging

🚀 Production

Environments

Phoenix Cloud

Container

Notebooks

Terminal

Production Guide

Configuration

Enable Batch Processing

Use gRPC Transport

Scaling Facilities

Enable Database Backups

Resourcing Guidelines

Memory Sizing

Database Sizing

Backups

Tracing

Overview: Tracing

Next steps

Quickstart: Tracing

Quickstarts

Explore a Demo Trace

Quickstart: Tracing (Python)

Overview

Launch Phoenix

Connect to Phoenix

Trace your own functions

Trace all calls made to a library

View your Traces in Phoenix

Next Steps

Features: Tracing

Projects

Explore a Demo Project

Annotations

Next Steps

Sessions

Next Steps

Setup using Phoenix OTEL

Quickstart: phoenix.otel.register

Phoenix Authentication

Configuring the collector endpoint

Using environment variables

Specifying the endpoint directly

Additional configuration

Instrumentation

Setup Projects

Log to a specific project

Switching projects in a notebook

Add Metadata

Instrument Prompt Templates and Prompt Variables

Annotate Traces

Guides

Annotating in the UI

Configuring Annotations

Adding Annotations

Viewing Annotations

Exporting Traces with specific Annotation Values

Annotating Auto-Instrumented Spans

Usage

Working with Multiple Span Types

Resources

Running Evals on Traces

Install dependencies & Set environment variables

Connect to Phoenix

Prepare trace dataset

Download trace dataset from Phoenix

Generate evaluations

Upload evaluations to Phoenix

Quickstart: `phoenix.otel.register`

Specifying the `endpoint` directly