How to annotate traces in the UI for analysis and dataset curation
To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings
and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.
Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate
button and fill out the form to rate different steps in your AI application.
You can also take notes as you go by either clicking on the explain
link or by adding your notes to the bottom messages UI.
You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations
tab.
Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.
As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations
tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.
Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.
Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:
Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness
Phoenix allows you to annotate traces through the Client, the REST API, or the UI.
To learn how to configure annotations and to annotate through the UI, see Annotating in the UI
To learn how to add human labels to your traces, either manually or programmatically, see Annotating via the Client
To learn how to evaluate traces captured in Phoenix, see Running Evals on Traces
To learn how to upload your own evaluation labels into Phoenix, see Log Evaluation Results
For more background on the concept of annotations, see Annotations
Use the capture_span_context context manager to annotate auto-instrumented spans
When working with spans that are automatically instrumented via OpenInference in your LLM applications, you often need to capture span contexts to apply feedback or annotations. The capture_span_context
context manager provides a convenient way to capture all OpenInference spans within its scope, making it easier to apply feedback to specific spans in downstream operations.
The capture_span_context
context manager allows you to:
Capture all spans created within a specific code block
Retrieve span contexts for later use in feedback systems
Maintain a clean separation between span creation and annotation logic
Apply feedback to spans without needing to track span IDs manually
You can use the captured span contexts to implement custom feedback logic. The captured span contexts integrate seamlessly with Phoenix's annotation system:
from openinference.instrumentation import capture_span_context
from opentelemetry.trace.span import format_span_id
from phoenix.client import Client
client = Client()
def process_llm_request_with_feedback(prompt: str):
with capture_span_context() as capture:
# Make LLM call (auto-instrumented)
response = llm.invoke("Generate a summary")
# Get user feedback (simulated)
user_feedback = get_user_feedback(response)
# Method 1: Get span ID using get_last_span_id (most recent span)
last_span_id = capture.get_last_span_id()
# Apply feedback to the most recent span
if last_span_id:
client.annotations.add_span_annotation(
annotation_name="user_feedback",
annotator_kind="HUMAN",
span_id=last_span_id,
label=user_feedback.label,
score=user_feedback.score,
explanation=user_feedback.explanation
)
# Method 2: Get all captured span contexts and iterate
span_contexts = capture.get_span_contexts()
# Apply feedback to all captured spans
for span_context in span_contexts:
# Convert span context to span ID for annotation
span_id = format_span_id(span_context.span_id)
# Add annotation to Phoenix
client.annotations.add_span_annotation(
annotation_name="user_feedback_all",
annotator_kind="HUMAN",
span_id=span_id,
label=user_feedback.label,
score=user_feedback.score,
explanation=user_feedback.explanation
)
You can filter spans based on their attributes:
with capture_span_context() as capture:
# Make LLM call (auto-instrumented)
response = llm.invoke("Generate a summary")
span_contexts = capture.get_span_contexts()
# Filter for specific span types
llm_spans = [
ctx for ctx in span_contexts
if hasattr(ctx, 'attributes')
]
# Apply different feedback logic to different span types
for span_context in llm_spans:
apply_llm_feedback(span_context)
How to use an LLM judge to label and score your application
This guide will walk you through the process of evaluating traces captured in Phoenix, and exporting the results to the Phoenix UI.
This process is similar to the evaluation quickstart guide, but instead of creating your own dataset or using an existing external one, you'll export a trace dataset from Phoenix and log the evaluation results to Phoenix.
pip install -q "arize-phoenix>=4.29.0"
pip install -q openai 'httpx<0.28'
import os
from getpass import getpass
import dotenv
dotenv.load_dotenv()
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
Note: if you're self-hosting Phoenix, swap your collector endpoint variable in the snippet below, and remove the Phoenix Client Headers variable.
import os
PHOENIX_API_KEY = "ADD YOUR API KEY"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here.
from phoenix.otel import register
tracer_provider = register(project_name="evaluating_traces_quickstart")
For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to "Download trace dataset from Phoenix"
%%bash
pip install -q openinference-instrumentation-openai
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI()
# Function to generate a joke
def generate_joke():
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that generates jokes."},
{"role": "user", "content": "Tell me a joke."},
],
)
joke = response.choices[0].message.content
return joke
# Generate 5 different jokes
jokes = []
for _ in range(5):
joke = generate_joke()
jokes.append(joke)
print(f"Joke {len(jokes)}:\n{joke}\n")
print(f"Generated {len(jokes)} jokes and tracked them in Phoenix.")
import phoenix as px
spans_df = px.Client().get_spans_dataframe(project_name="evaluating_traces_quickstart")
spans_df.head()
Now that we have our trace dataset, we can generate evaluations for each trace. Evaluations can be generated in many different ways. Ultimately, we want to end up with a set of labels and/or scores for our traces.
You can generate evaluations using:
Plain code
Phoenix's built-in LLM as a Judge evaluators
Your own custom LLM as a Judge evaluator
Other evaluation packages
As long as you format your evaluation results properly, you can upload them to Phoenix and visualize them in the UI.
Let's start with a simple example of generating evaluations using plain code. OpenAI has a habit of repeating jokes, so we'll generate evaluations to label whether a joke is a repeat of a previous joke.
# Create a new DataFrame with selected columns
eval_df = spans_df[["context.span_id", "attributes.llm.output_messages"]].copy()
eval_df.set_index("context.span_id", inplace=True)
# Create a list to store unique jokes
unique_jokes = set()
# Function to check if a joke is a duplicate
def is_duplicate(joke_data):
joke = joke_data[0]["message.content"]
if joke in unique_jokes:
return True
else:
unique_jokes.add(joke)
return False
# Apply the is_duplicate function to create the new column
eval_df["label"] = eval_df["attributes.llm.output_messages"].apply(is_duplicate)
# Convert boolean to integer (0 for False, 1 for True)
eval_df["label"] = eval_df["label"]
# Reset unique_jokes list to ensure correct results if the cell is run multiple times
unique_jokes.clear()
We now have a DataFrame with a column for whether each joke is a repeat of a previous joke. Let's upload this to Phoenix.
Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "label" and "score" to display in the UI.
eval_df["score"] = eval_df["score"].astype(int)
eval_df["label"] = eval_df["label"].astype(str)
from phoenix.trace import SpanEvaluations
px.Client().log_evaluations(SpanEvaluations(eval_name="Duplicate", dataframe=eval_df))
You should now see evaluations in the Phoenix UI!
From here you can continue collecting and evaluating traces, or move on to one of these other guides:
If you're interested in more complex evaluation and evaluators, start with how to use LLM as a Judge evaluators
If you're ready to start testing your application in a more rigorous manner, check out how to run structured experiments
This guide shows how LLM evaluation results in dataframes can be sent to Phoenix.
An evaluation must have a name
(e.g. "Q&A Correctness") and its DataFrame must contain identifiers for the subject of evaluation, e.g. a span or a document (more on that below), and values under either the score
, label
, or explanation
columns. See for more information.
Before accessing px.Client(), be sure you've set the following environment variables:
A dataframe of span evaluations would look similar like the table below. It must contain span_id
as an index or as a column. Once ingested, Phoenix uses the span_id
to associate the evaluation with its target span.
The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name=
parameter. In this case we name it "Q&A Correctness".
A dataframe of document evaluations would look something like the table below. It must contain span_id
and document_position
as either indices or columns. document_position
is the document's (zero-based) index in the span's list of retrieved documents. Once ingested, Phoenix uses the span_id
and document_position
to associate the evaluation with its target span and document.
The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name=
parameter. In this case we name it "Relevance".
Multiple sets of Evaluations can be logged by the same px.Client().log_evaluations()
function call.
By default the client will push traces to the project specified in the PHOENIX_PROJECT_NAME
environment variable or to the default
project. If you want to specify the destination project explicitly, you can pass the project name as a parameter.
import os
# Used by local phoenix deployments with auth:
os.environ["PHOENIX_API_KEY"] = "..."
# Used by Phoenix Cloud deployments:
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."
# Be sure to modify this if you're self-hosting Phoenix:
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
5B8EF798A381
correct
"this is correct ..."
E19B7EC3GG02
incorrect
"this is incorrect ..."
from phoenix.trace import SpanEvaluations
import os
px.Client().log_evaluations(
SpanEvaluations(
dataframe=qa_correctness_eval_df,
eval_name="Q&A Correctness",
),
)
5B8EF798A381
relevant
"this is ..."
5B8EF798A381
irrelevant
"this is ..."
E19B7EC3GG02
relevant
"this is ..."
from phoenix.trace import DocumentEvaluations
px.Client().log_evaluations(
DocumentEvaluations(
dataframe=document_relevance_eval_df,
eval_name="Relevance",
),
)
px.Client().log_evaluations(
SpanEvaluations(
dataframe=qa_correctness_eval_df,
eval_name="Q&A Correctness",
),
DocumentEvaluations(
dataframe=document_relevance_eval_df,
eval_name="Relevance",
),
SpanEvaluations(
dataframe=hallucination_eval_df,
eval_name="Hallucination",
),
# ... as many as you like
)
from phoenix.trace import SpanEvaluations
px.Client().log_evaluations(
SpanEvaluations(
dataframe=qa_correctness_eval_df,
eval_name="Q&A Correctness",
),
project_name="<my-project>"
)
Use the phoenix client to capture end-user feedback
When building LLM applications, it is important to collect feedback to understand how your app is performing in production. Phoenix lets you attach feedback to spans and traces in the form of annotations.
Annotations come from a few different sources:
Human Annotators
End users of your application
LLMs-as-Judges
Basic code checks
You can use the Phoenix SDK and API to attach feedback to a span.
Phoenix expects feedback to be in the form of an annotation. Annotations consist of these fields:
{
"span_id": "67f6740bbe1ddc3f", // the id of the span to annotate
"name": "correctness", // the name of your annotation
"annotator_kind": "HUMAN", // HUMAN, LLM, or CODE
"result": {
"label": "correct", // A human-readable category for the feedback
"score": 0.85, // a numeric score, can be 0 or 1, or a range like 0 to 100
"explanation": "The response answered the question I asked"
},
"metadata": {
"model": "gpt-4",
"threshold_ms": 500,
"confidence": "high"
},
"identifier": "user-123" // optional, identifies the annotation and enables upserts
}
Note that you can provide a label, score, or explanation. With Phoenix an annotation has a name (like correctness), is associated with an annotator (LLM, HUMAN, or CODE), and can be attached to the spans you have logged to Phoenix.
Phoenix allows you to log multiple annotations of the same name to the same span. For example, a single span could have 5 different "correctness" annotations. This can be useful when collecting end user feedback.
Note: The API will overwrite span annotations of the same name, unless they have different "identifier" values.
If you want to track multiple annotations of the same name on the same span, make sure to include different "identifier" values on each.
Once you construct the annotation, you can send this to Phoenix via it's REST API. You can POST an annotation from your application to /v1/span_annotations
like so:
If you're self-hosting Phoenix, be sure to change the endpoint in the code below to <your phoenix endpoint>/v1/span_annotations?sync=false
Retrieve the current span_id
If you'd like to collect feedback on currently instrumented code, you can get the current span using the opentelemetry
SDK.
from opentelemetry.trace import format_span_id, get_current_span
span = get_current_span()
span_id = format_span_id(span.get_span_context().span_id)
You can use the span_id to send an annotation associated with that span.
from phoenix.client import Client
client = Client()
annotation = client.annotations.add_span_annotation(
annotation_name="user feedback",
annotator_kind="HUMAN",
span_id=span_id,
label="thumbs-up",
score=1,
)
Retrieve the current spanId
import { trace } from "@opentelemetry/api";
async function chat(req, res) {
// ...
const spanId = trace.getActiveSpan()?.spanContext().spanId;
}
You can use the spanId to send an annotation associated with that span.
import { createClient } from '@arizeai/phoenix-client';
const PHOENIX_API_KEY = 'your_api_key';
const px = createClient({
options: {
// change to self-hosted base url if applicable
baseUrl: 'https://app.phoenix.arize.com',
headers: {
api_key: PHOENIX_API_KEY,
Authorization: `Bearer ${PHOENIX_API_KEY}`,
},
},
});
export async function postFeedback(
spanId: string,
name: string,
label: string,
score: number,
explanation?: string,
metadata?: Record<string, unknown>
) {
const response = await px.POST('/v1/span_annotations', {
params: { query: { sync: true } },
body: {
data: [
{
span_id: spanId,
name: name,
annotator_kind: 'HUMAN',
result: {
label: label,
score: score,
explanation: explanation || null,
},
metadata: metadata || {},
},
],
},
});
if (!response || !response.data) {
throw new Error('Annotation failed');
}
return response.data.data;
}
curl -X 'POST' \
'https://app.phoenix.arize.com/v1/span_annotations?sync=false' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'api_key: <your phoenix api key> \
-d '{
"data": [
{
"span_id": "67f6740bbe1ddc3f",
"name": "correctness",
"annotator_kind": "HUMAN",
"result": {
"label": "correct",
"score": 0.85,
"explanation": "The response answered the question I asked"
},
"metadata": {
"model": "gpt-4",
"threshold_ms": 500,
"confidence": "high"
}
}
]
}'