All pages
Powered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Annotating in the UI

How to annotate traces in the UI for analysis and dataset curation

Configuring Annotations

To annotate data in the UI, you first will want to setup a rubric for how to annotate. Navigate to Settings and create annotation configs (e.g. a rubric) for your data. You can create various different types of annotations: Categorical, Continuous, and Freeform.

Annotation Types
  • Annotation Type: - Categorical: Predefined labels for selection. (e.x. 👍 or 👎) - Continuous: a score across a specified range. (e.g. confidence score 0-100) - Freeform: Open-ended text comments. (e.g. "correct")

  • Optimize the direction based on your goal: - Maximize: higher scores are better. (e.g. confidence) - Minimize: lower scores are better. (e.g. hallucinations) - None: direction optimization does not apply. (e.g. tone)

Different types of annotations change the way human annotators provide feedback
Configure an annotation to guide how a user should input an annotation

Adding Annotations

Once annotations are configured, you can add them to your project to build out a custom annotation form

Once you have annotations configured, you can associate annotations to the data that you have traced. Click on the Annotate button and fill out the form to rate different steps in your AI application. You can also take notes as you go by either clicking on the explain link or by adding your notes to the bottom messages UI. You can always come back and edit / and delete your annotations. Annotations can be deleted from the table view under the Annotations tab.

Once an annotation has been provided, you can also add a reason to explain why this particular label or score was provided. This is useful to add additional context to the annotation.

Viewing Annotations

As annotations come in from various sources (annotators, evals), the entire list of annotations can be found under the Annotations tab. Here you can see the author, the annotator kind (e.g. was the annotation performed by a human, llm, or code), and so on. This can be particularly useful if you want to see if different annotators disagree.

You can view the annotations by different users, llms, and annotators

Exporting Traces with specific Annotation Values

Once you have collected feedback in the form of annotations, you can filter your traces by the annotation values to narrow down to interesting samples (e.x. llm spans that are incorrect). Once filtered down to a sample of spans, you can export your selection to a dataset, which in turn can be used for things like experimentation, fine-tuning, or building a human-aligned eval.

Narrow down your data to areas that need more attention or refinement

Annotate Traces

Applying the scientific method to building AI products - By Eugene Yan

Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

  1. Track performance over time

  2. Identify areas for improvement

  3. Compare different model versions or prompts

  4. Gather data for fine-tuning or retraining

  5. Provide stakeholders with concrete metrics on system effectiveness

Phoenix allows you to annotate traces through the Client, the REST API, or the UI.

Guides

  • To learn how to configure annotations and to annotate through the UI, see Annotating in the UI

  • To learn how to add human labels to your traces, either manually or programmatically, see Annotating via the Client

  • To learn how to evaluate traces captured in Phoenix, see Running Evals on Traces

  • To learn how to upload your own evaluation labels into Phoenix, see Log Evaluation Results

For more background on the concept of annotations, see Annotations

Adding manual annotations to traces

Annotating Auto-Instrumented Spans

Use the capture_span_context context manager to annotate auto-instrumented spans

Assumes you are using openinference-instrumentation>=0.1.34

When working with spans that are automatically instrumented via OpenInference in your LLM applications, you often need to capture span contexts to apply feedback or annotations. The capture_span_context context manager provides a convenient way to capture all OpenInference spans within its scope, making it easier to apply feedback to specific spans in downstream operations.

The capture_span_context context manager allows you to:

  • Capture all spans created within a specific code block

  • Retrieve span contexts for later use in feedback systems

  • Maintain a clean separation between span creation and annotation logic

  • Apply feedback to spans without needing to track span IDs manually

Usage

You can use the captured span contexts to implement custom feedback logic. The captured span contexts integrate seamlessly with Phoenix's annotation system:

from openinference.instrumentation import capture_span_context
from opentelemetry.trace.span import format_span_id
from phoenix.client import Client

client = Client()

def process_llm_request_with_feedback(prompt: str):
    with capture_span_context() as capture:
        # Make LLM call (auto-instrumented)
        response = llm.invoke("Generate a summary")
        # Get user feedback (simulated)
        user_feedback = get_user_feedback(response)
        
        # Method 1: Get span ID using get_last_span_id (most recent span)
        last_span_id = capture.get_last_span_id()
        # Apply feedback to the most recent span
        if last_span_id:
            client.annotations.add_span_annotation(
                annotation_name="user_feedback",
                annotator_kind="HUMAN",
                span_id=last_span_id,
                label=user_feedback.label,
                score=user_feedback.score,
                explanation=user_feedback.explanation
            )
        
        # Method 2: Get all captured span contexts and iterate
        span_contexts = capture.get_span_contexts()
        # Apply feedback to all captured spans
        for span_context in span_contexts:
            # Convert span context to span ID for annotation
            span_id = format_span_id(span_context.span_id)
            
            # Add annotation to Phoenix
            client.annotations.add_span_annotation(
                annotation_name="user_feedback_all",
                annotator_kind="HUMAN",
                span_id=span_id,
                label=user_feedback.label,
                score=user_feedback.score,
                explanation=user_feedback.explanation
            )

Working with Multiple Span Types

You can filter spans based on their attributes:

with capture_span_context() as capture:
    # Make LLM call (auto-instrumented)
    response = llm.invoke("Generate a summary")
    
    span_contexts = capture.get_span_contexts()
    
    # Filter for specific span types
    llm_spans = [
        ctx for ctx in span_contexts 
        if hasattr(ctx, 'attributes')
    ]
    
    # Apply different feedback logic to different span types
    for span_context in llm_spans:
        apply_llm_feedback(span_context)

Resources

  • OpenInference

Running Evals on Traces

How to use an LLM judge to label and score your application

This guide will walk you through the process of evaluating traces captured in Phoenix, and exporting the results to the Phoenix UI.

This process is similar to the evaluation quickstart guide, but instead of creating your own dataset or using an existing external one, you'll export a trace dataset from Phoenix and log the evaluation results to Phoenix.

Install dependencies & Set environment variables

pip install -q "arize-phoenix>=4.29.0"
pip install -q openai 'httpx<0.28'
import os
from getpass import getpass

import dotenv

dotenv.load_dotenv()

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

Connect to Phoenix

Note: if you're self-hosting Phoenix, swap your collector endpoint variable in the snippet below, and remove the Phoenix Client Headers variable.

import os

PHOENIX_API_KEY = "ADD YOUR API KEY"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here.

from phoenix.otel import register

tracer_provider = register(project_name="evaluating_traces_quickstart")

Prepare trace dataset

For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to "Download trace dataset from Phoenix"

%%bash
pip install -q openinference-instrumentation-openai
from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI()


# Function to generate a joke
def generate_joke():
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that generates jokes."},
            {"role": "user", "content": "Tell me a joke."},
        ],
    )
    joke = response.choices[0].message.content
    return joke


# Generate 5 different jokes
jokes = []
for _ in range(5):
    joke = generate_joke()
    jokes.append(joke)
    print(f"Joke {len(jokes)}:\n{joke}\n")

print(f"Generated {len(jokes)} jokes and tracked them in Phoenix.")

Download trace dataset from Phoenix

import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name="evaluating_traces_quickstart")
spans_df.head()

Generate evaluations

Now that we have our trace dataset, we can generate evaluations for each trace. Evaluations can be generated in many different ways. Ultimately, we want to end up with a set of labels and/or scores for our traces.

You can generate evaluations using:

  • Plain code

  • Phoenix's built-in LLM as a Judge evaluators

  • Your own custom LLM as a Judge evaluator

  • Other evaluation packages

As long as you format your evaluation results properly, you can upload them to Phoenix and visualize them in the UI.

Let's start with a simple example of generating evaluations using plain code. OpenAI has a habit of repeating jokes, so we'll generate evaluations to label whether a joke is a repeat of a previous joke.

# Create a new DataFrame with selected columns
eval_df = spans_df[["context.span_id", "attributes.llm.output_messages"]].copy()
eval_df.set_index("context.span_id", inplace=True)

# Create a list to store unique jokes
unique_jokes = set()


# Function to check if a joke is a duplicate
def is_duplicate(joke_data):
    joke = joke_data[0]["message.content"]
    if joke in unique_jokes:
        return True
    else:
        unique_jokes.add(joke)
        return False


# Apply the is_duplicate function to create the new column
eval_df["label"] = eval_df["attributes.llm.output_messages"].apply(is_duplicate)

# Convert boolean to integer (0 for False, 1 for True)
eval_df["label"] = eval_df["label"]

# Reset unique_jokes list to ensure correct results if the cell is run multiple times
unique_jokes.clear()

We now have a DataFrame with a column for whether each joke is a repeat of a previous joke. Let's upload this to Phoenix.

Upload evaluations to Phoenix

Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "label" and "score" to display in the UI.

eval_df["score"] = eval_df["score"].astype(int)
eval_df["label"] = eval_df["label"].astype(str)
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="Duplicate", dataframe=eval_df))

You should now see evaluations in the Phoenix UI!

From here you can continue collecting and evaluating traces, or move on to one of these other guides:

  • If you're interested in more complex evaluation and evaluators, start with how to use LLM as a Judge evaluators

  • If you're ready to start testing your application in a more rigorous manner, check out how to run structured experiments

Log Evaluation Results

This guide shows how LLM evaluation results in dataframes can be sent to Phoenix.

An evaluation must have a name (e.g. "Q&A Correctness") and its DataFrame must contain identifiers for the subject of evaluation, e.g. a span or a document (more on that below), and values under either the score, label, or explanation columns. See for more information.

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

Span Evaluations

A dataframe of span evaluations would look similar like the table below. It must contain span_id as an index or as a column. Once ingested, Phoenix uses the span_id to associate the evaluation with its target span.

span_id
label
score
explanation

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Q&A Correctness".

Document Evaluations

A dataframe of document evaluations would look something like the table below. It must contain span_id and document_position as either indices or columns. document_position is the document's (zero-based) index in the span's list of retrieved documents. Once ingested, Phoenix uses the span_id and document_position to associate the evaluation with its target span and document.

span_id
document_position
label
score
explanation

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Relevance".

Logging Multiple Evaluation DataFrames

Multiple sets of Evaluations can be logged by the same px.Client().log_evaluations() function call.

Specifying A Project for the Evaluations

By default the client will push traces to the project specified in the PHOENIX_PROJECT_NAME environment variable or to the default project. If you want to specify the destination project explicitly, you can pass the project name as a parameter.

import os

# Used by local phoenix deployments with auth:
os.environ["PHOENIX_API_KEY"] = "..."

# Used by Phoenix Cloud deployments:
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."

# Be sure to modify this if you're self-hosting Phoenix:
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

5B8EF798A381

correct

1

"this is correct ..."

E19B7EC3GG02

incorrect

0

"this is incorrect ..."

from phoenix.trace import SpanEvaluations
import os

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
)

5B8EF798A381

0

relevant

1

"this is ..."

5B8EF798A381

1

irrelevant

0

"this is ..."

E19B7EC3GG02

0

relevant

1

"this is ..."

from phoenix.trace import DocumentEvaluations

px.Client().log_evaluations(
    DocumentEvaluations(
        dataframe=document_relevance_eval_df,
        eval_name="Relevance",
    ),
)
px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
    DocumentEvaluations(
        dataframe=document_relevance_eval_df,
        eval_name="Relevance",
    ),
    SpanEvaluations(
        dataframe=hallucination_eval_df,
        eval_name="Hallucination",
    ),
    # ... as many as you like
)
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
    project_name="<my-project>"
)
Evaluations

Annotating via the Client

Use the phoenix client to capture end-user feedback

This assumes annotations as of arize-phoenix>=9.0.0.

When building LLM applications, it is important to collect feedback to understand how your app is performing in production. Phoenix lets you attach feedback to spans and traces in the form of annotations.

Annotations come from a few different sources:

  • Human Annotators

  • End users of your application

  • LLMs-as-Judges

  • Basic code checks

You can use the Phoenix SDK and API to attach feedback to a span.

Phoenix expects feedback to be in the form of an annotation. Annotations consist of these fields:

{
  "span_id": "67f6740bbe1ddc3f",  // the id of the span to annotate
  "name": "correctness",  // the name of your annotation
  "annotator_kind": "HUMAN",  // HUMAN, LLM, or CODE
  "result": {
    "label": "correct",  // A human-readable category for the feedback
    "score": 0.85,  // a numeric score, can be 0 or 1, or a range like 0 to 100
    "explanation": "The response answered the question I asked"
  },
  "metadata": {
    "model": "gpt-4",
    "threshold_ms": 500,
    "confidence": "high"
  },
  "identifier": "user-123"  // optional, identifies the annotation and enables upserts
}

Note that you can provide a label, score, or explanation. With Phoenix an annotation has a name (like correctness), is associated with an annotator (LLM, HUMAN, or CODE), and can be attached to the spans you have logged to Phoenix.

Phoenix allows you to log multiple annotations of the same name to the same span. For example, a single span could have 5 different "correctness" annotations. This can be useful when collecting end user feedback.

Note: The API will overwrite span annotations of the same name, unless they have different "identifier" values.

If you want to track multiple annotations of the same name on the same span, make sure to include different "identifier" values on each.

Send Annotations to Phoenix

Once you construct the annotation, you can send this to Phoenix via it's REST API. You can POST an annotation from your application to /v1/span_annotations like so:

If you're self-hosting Phoenix, be sure to change the endpoint in the code below to <your phoenix endpoint>/v1/span_annotations?sync=false

Retrieve the current span_id

If you'd like to collect feedback on currently instrumented code, you can get the current span using the opentelemetry SDK.

from opentelemetry.trace import format_span_id, get_current_span

span = get_current_span()
span_id = format_span_id(span.get_span_context().span_id)

You can use the span_id to send an annotation associated with that span.

from phoenix.client import Client

client = Client()
annotation = client.annotations.add_span_annotation(
    annotation_name="user feedback",
    annotator_kind="HUMAN",
    span_id=span_id,
    label="thumbs-up",
    score=1,
)

Retrieve the current spanId

import { trace } from "@opentelemetry/api";

async function chat(req, res) {
  // ...
  const spanId = trace.getActiveSpan()?.spanContext().spanId;
}

You can use the spanId to send an annotation associated with that span.

import { createClient } from '@arizeai/phoenix-client';

const PHOENIX_API_KEY = 'your_api_key';

const px = createClient({
  options: {
    // change to self-hosted base url if applicable
    baseUrl: 'https://app.phoenix.arize.com',
    headers: {
      api_key: PHOENIX_API_KEY,
      Authorization: `Bearer ${PHOENIX_API_KEY}`,
    },
  },
});

export async function postFeedback(
  spanId: string,
  name: string,
  label: string,
  score: number,
  explanation?: string,
  metadata?: Record<string, unknown>
) {
  const response = await px.POST('/v1/span_annotations', {
    params: { query: { sync: true } },
    body: {
      data: [
        {
          span_id: spanId,
          name: name,
          annotator_kind: 'HUMAN',
          result: {
            label: label,
            score: score,
            explanation: explanation || null,
          },
          metadata: metadata || {},
        },
      ],
    },
  });

  if (!response || !response.data) {
    throw new Error('Annotation failed');
  }

  return response.data.data;
}
curl -X 'POST' \
  'https://app.phoenix.arize.com/v1/span_annotations?sync=false' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'api_key: <your phoenix api key> \
  -d '{
  "data": [
    {
      "span_id": "67f6740bbe1ddc3f",
      "name": "correctness",
      "annotator_kind": "HUMAN",
      "result": {
        "label": "correct",
        "score": 0.85,
        "explanation": "The response answered the question I asked"
      },
      "metadata": {
        "model": "gpt-4",
        "threshold_ms": 500,
        "confidence": "high"
      }
    }
  ]
}'