Trace-Level Evals
How to measure LLM application performance at the trace level.
A trace is a complete record of all operations (spans) that occur during a single execution or request in your LLM application. Trace-level evaluations provide a holistic view of:
Whether the overall workflow or task was completed successfully
The quality and correctness of the end-to-end process
Aggregate metrics such as latency, errors, and evaluation labels for the entire trace
The success of multi-step workflows (ex: Agentic reasoning, RAG pipelines)
Trace-Level Evaluations via UI
To run evaluations at the trace level in the UI, set the evaluator scope to “Trace” for each evaluator you want to operate at that level.
You can also apply filters to focus the evaluation on specific parts of a trace. If no filters are applied, the evaluation will consider the entire trace by default.
Trace-Level Evaluations via Code
This step-by-step guide shows how to evaluate existing traces in code and log the results back to Arize.
1. Get trace data from Arize
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone
client = ArizeExportClient()
primary_df = client.export_model_to_df(
space_id = "YOUR_ARIZE_SPACE_ID",
model_id = "YOUR_ARIZE_MODEL_ID",
environment = Environments.TRACING,
start_time = datetime.now(timezone.utc) - timedelta(days=7),
end_time = datetime.now(timezone.utc),
)2. Prepare trace-level data for evaluation
Aggregate or filter your dataframe. For example, you might group by context.trace_id and concatenate relevant fields for your evaluation:
import pandas as pd
# Example: Aggregate all outputs in a trace
trace_df = (
primary_df.groupby("context.trace_id")
.agg({
"attributes.input.value": "first",
"attributes.output.value": lambda x: " ".join(x.dropna()),
# Add other aggregations as needed
})
)3. Define your evaluation prompt
Create a prompt template for the LLM judge to assess the entire trace. For example, to judge overall correctness:
TRACE_EVAL_PROMPT = """
You are evaluating the overall quality and correctness of an LLM application's response to a user request.
You will be given:
1. The user input that initiated the trace
2. The full output(s) generated during the trace
##
User Input:
{attributes.input.value}
Trace Output:
{attributes.output.value}
##
Respond with exactly one word: `correct` or `incorrect`.
- `correct` → the trace achieves the intended goal.
- `incorrect` → the trace fails to achieve the goal or is low quality.
"""4. Run the evaluation
Use Phoenix Evals to run your evaluation:
from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
correctness_evaluator = create_classifier(
name="correctness",
llm=llm,
prompt_template=TRACE_EVAL_PROMPT,
choices={"correct": 1.0, "incorrect": 0.0},
)
results_df = await async_evaluate_dataframe(
dataframe=trace_df,
evaluators=[correctness_evaluator],
)5. Log the results back to Arize
Merge your evaluation results with original trace data to align trace_id with root span_id. Make sure to use the prefix trace_eval when naming columns before logging results:
from arize.pandas.logger import Client
from phoenix.evals.utils import to_annotation_dataframe
import ast
import pandas as pd
client = Client()
root_spans = primary_df[primary_df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
# Merge results with root spans to align on trace_id
results_with_spans = pd.merge(
results_df.reset_index(), root_spans, on="context.trace_id", how="left"
).set_index("context.span_id", drop=False)
# Format for logging
correctness_eval_df = to_annotation_dataframe(results_with_spans)
# using trace_eval prefix to rename columns
correctness_eval_df = correctness_eval_df.rename(columns={
"label": "trace_eval.correctness.label",
"score": "trace_eval.correctness.score",
"explanation": "trace_eval.correctness.explanation"
})
client.log_evaluations_sync(correctness_eval_df, 'your-project-name')Last updated
Was this helpful?

