A trace is a complete record of all operations (spans) that occur during a single execution or request in your LLM application. Trace-level evaluations provide a holistic view of:

Whether the overall workflow or task was completed successfully
The quality and correctness of the end-to-end process
Aggregate metrics such as latency, errors, and evaluation labels for the entire trace
The success of multi-step workflows (ex: Agentic reasoning, RAG pipelines)

Trace-Level Evaluations via UI
Trace-Level Evaluations via Code

Trace-Level Evaluations via UI

To run evaluations at the trace level in the UI, set the evaluator scope to “Trace” for each evaluator you want to operate at that level.

You can also apply filters to focus the evaluation on specific parts of a trace. If no filters are applied, the evaluation will consider the entire trace by default.

Trace-Level Evaluations via Code

This step-by-step guide shows how to evaluate existing traces in code and log the results back to Arize.

1. Get trace data from Arize

Coming Soon

2. Prepare trace-level data for evaluation

Aggregate or filter your dataframe. For example, you might group by context.trace_id and concatenate relevant fields for your evaluation:

import pandas as pd

# Example: Aggregate all outputs in a trace
trace_df = (
    primary_df.groupby("context.trace_id")
      .agg({
          "attributes.input.value": "first",
          "attributes.output.value": lambda x: " ".join(x.dropna()),
          # Add other aggregations as needed
      })
)

3. Define your evaluation prompt

Create a prompt template for the LLM judge to assess the entire trace. For example, to judge overall correctness:

TRACE_EVAL_PROMPT = """
You are evaluating the overall quality and correctness of an LLM application's response to a user request.

You will be given:
1. The user input that initiated the trace
2. The full output(s) generated during the trace

##
User Input:
{attributes.input.value}

Trace Output:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
- `correct` → the trace achieves the intended goal.
- `incorrect` → the trace fails to achieve the goal or is low quality.
"""

4. Run the evaluation

Use Phoenix Evals to run your evaluation:

from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

correctness_evaluator = create_classifier(
    name="correctness",
    llm=llm,
    prompt_template=TRACE_EVAL_PROMPT,
    choices={"correct": 1.0, "incorrect": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=trace_df,
    evaluators=[correctness_evaluator],
)

5. Log the results back to Arize

Merge your evaluation results with original trace data to align trace_id with root span_id. Make sure to use the prefix trace_eval when naming columns before logging results:

Coming Soon

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Trace-Level Evals

Trace-Level Evaluations via UI

Trace-Level Evaluations via Code

1. Get trace data from Arize

2. Prepare trace-level data for evaluation

3. Define your evaluation prompt

4. Run the evaluation

5. Log the results back to Arize

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Trace-Level Evaluations via UI

​Trace-Level Evaluations via Code

​1. Get trace data from Arize

​2. Prepare trace-level data for evaluation

​3. Define your evaluation prompt

​4. Run the evaluation

​5. Log the results back to Arize

Trace-Level Evaluations via UI

Trace-Level Evaluations via Code

1. Get trace data from Arize

2. Prepare trace-level data for evaluation

3. Define your evaluation prompt

4. Run the evaluation

5. Log the results back to Arize