Trace-Level Evaluations

How to measure LLM application performance at the trace level.

Overview

A trace is a complete record of all operations (spans) that occur during a single execution or request in your LLM application. Trace-level evaluations provide a holistic view of:

  • Whether the overall workflow or task was completed successfully

  • The quality and correctness of the end-to-end process

  • Aggregate metrics such as latency, errors, and evaluation labels for the entire trace

  • The success of multi-step workflows (ex: Agentic reasoning, RAG pipelines)

How to Run Trace-Level Evaluations

This step-by-step guide shows how to evaluate the quality of an entire trace and log the results back to Arize.

1. Pull trace data from Arize

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient()

df = client.export_model_to_df(
    space_id = "YOUR_SPACE_ID",
    model_id = "YOUR_MODEL_ID",
    environment = Environments.TRACING,
    start_time = datetime.now(timezone.utc) - timedelta(days=7),
    end_time   = datetime.now(timezone.utc),
)

2. Prepare trace-level data for evaluation

Aggregate or filter your dataframe. For example, you might group by context.trace_id and concatenate relevant fields for your evaluation

import pandas as pd

# Example: Aggregate all outputs in a trace
trace_df = (
    df.groupby("context.trace_id")
      .agg({
          "attributes.input.value": "first",
          "attributes.output.value": lambda x: " ".join(x.dropna()),
          # Add other aggregations as needed
      })
      .reset_index()
)

3. Define your evaluation prompt

Create a prompt template for the LLM judge to assess the entire trace. For example, to judge overall correctness:

TRACE_EVAL_PROMPT = """
You are evaluating the overall quality and correctness of an LLM application's response to a user request.

You will be given:
1. The user input that initiated the trace
2. The full output(s) generated during the trace

##
User Input:
{attributes.input.value}

Trace Output:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
- `correct` → the trace achieves the intended goal.
- `incorrect` → the trace fails to achieve the goal or is low quality.
"""

4. Run the evaluation

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio, os

nest_asyncio.apply()

model = OpenAIModel(
    api_key = os.environ["OPENAI_API_KEY"],
    model   = "gpt-4o-mini",
    temperature = 0.0,
)

rails = ["correct", "incorrect"]
results = llm_classify(
    dataframe           = trace_df,
    template            = TRACE_EVAL_PROMPT,
    model               = model,
    rails               = rails,
    provide_explanation = True,   
    verbose             = False,
)

5. Log the results back to Arize

Attach the evaluation results to the root span of each trace for analysis in the UI.

from arize.pandas.logger import Client

# Merge eval results with original trace data to grab span id
root_spans = df[df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = trace_df.merge(results, left_on="context.trace_id", right_index=True)
log_df = log_df.merge(root_spans, on="context.trace_id", how="left")
log_df.set_index("context.span_id", inplace=True)

arize_client = Client(
    space_id = os.environ["ARIZE_SPACE_ID"],
    api_key  = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
    dataframe = log_df,
    model_id  = os.environ["ARIZE_MODEL_ID"],
)

Resources

Last updated

Was this helpful?