Skip to main content
The most common way to evaluate an LLM application is end-to-end: take the user’s request, take the final answer, and judge whether the answer is good. That works for a single model call. It breaks down for agents, because an agent isn’t one call — it’s a sequence of decisions: which tools to call, in what order, with what arguments, and how to reason over the results. End-to-end eval only inspects the two endpoints — the input and the final output. The interesting failures happen in the middle, where you can’t see them:
  • Reasoning — did the agent reason coherently toward the goal, or get there by luck?
  • Tool selection — did it pick the right tools (and skip the wrong ones)?
  • Decision path — did it call those tools in a sensible order, with sensible arguments?
A correct-looking answer can come from a broken path — the right result for the wrong reasons, which fails the next time the inputs shift. This guide shows how to capture the full trace, reconstruct the agent’s intermediate signals from its spans, and evaluate those — using a movie recommendation agent as the worked example. We’ll go through the following steps:
  • Reconstruct an agent’s decision path (tool_path) and tool I/O (tool_io) from its spans
  • Run an endpoint check (recommendation relevance) alongside two intermediate checks (decision path and reasoning/support)
  • See each evaluator catch a distinct failure the endpoint check misses
  • Log trace-level evaluations back to Arize AX

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or watch the video above:

Trace level evals

After configuring tracing with arize.otel.register(...) and building a movie recommendation agent with three tools (movie_selector_llm, reviewer_llm, preview_summarizer_llm), run it against a handful of questions to generate traces. See the notebook for the full agent and tool definitions.

Get Span Data from Arize AX

Export your traces from Arize AX so we can reconstruct what each evaluator needs:
from datetime import datetime, timedelta, timezone

from arize.client import ArizeClient

ax_client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])

primary_df = ax_client.spans.export_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=model_id,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

Separate the endpoints from the trace

First pull the endpoints — the user’s question and the agent’s final answer. Take the final answer only, not a concatenation of every span’s output: folding in tool outputs would blur the line between “what the user saw” and “what happened inside the trace.”
With the OpenAI Agents instrumentation, the root AGENT span doesn’t record input.value / output.value — those attributes live on the underlying LLM spans, so we read the question and final reply from there with two small helpers. With an instrumentation that populates the root span, you could read attributes.input.value / attributes.output.value off the agent root directly.
import json
import pandas as pd

SPAN_KIND = "attributes.openinference.span.kind"


def _as_list(value):
    """Span attributes can arrive as a JSON string or an already-parsed list."""
    if isinstance(value, str):
        try:
            return json.loads(value)
        except json.JSONDecodeError:
            return []
    return value if isinstance(value, list) else []


def user_query(input_messages):
    """The first user message in an LLM span's input."""
    for message in _as_list(input_messages):
        if message.get("message.role") == "user":
            contents = message.get("message.contents") or []
            return message.get("message.content") or "".join(
                c.get("message_content.text", "") for c in contents
            )
    return None


def assistant_text(output_messages):
    """The assistant's text reply in an LLM span (empty on tool-calling turns)."""
    texts = []
    for message in _as_list(output_messages):
        if message.get("message.role") == "assistant":
            if message.get("message.content"):
                texts.append(message["message.content"])
            for c in message.get("message.contents") or []:
                if c.get("message_content.type") == "text":
                    texts.append(c.get("message_content.text", ""))
    return " ".join(t for t in texts if t).strip() or None


# Restrict to the agent's own traces: each has a root span (no parent) of kind
# AGENT. This keeps the evaluators' own LLM calls out of the evaluation set.
agent_roots = primary_df[(primary_df["parent_id"].isna()) & (primary_df[SPAN_KIND] == "AGENT")]
agent_trace_ids = agent_roots["context.trace_id"].unique()

llm_spans = primary_df[
    (primary_df[SPAN_KIND] == "LLM") & (primary_df["context.trace_id"].isin(agent_trace_ids))
].sort_values("start_time")

# Endpoint input  = the user's question (first user message).
# Endpoint output = the agent's FINAL answer only (its last text turn).
trace_df = pd.DataFrame(
    {
        "input": llm_spans.groupby("context.trace_id")["attributes.llm.input_messages"].apply(
            lambda s: next((q for q in s.map(user_query) if q), None)
        ),
        "output": llm_spans.groupby("context.trace_id")["attributes.llm.output_messages"].apply(
            lambda s: next((a for a in reversed(list(s.map(assistant_text))) if a), None)
        ),
    }
).dropna(subset=["input", "output"])

Reconstruct the intermediate signals

To evaluate the agent’s process, reconstruct two signals the endpoints never show, both from the trace’s TOOL spans (sorted by start_time):
  • tool_path — the ordered tool calls the agent made, with their arguments. This is the decision path: tool selection, order, and the arguments each tool was called with.
  • tool_io — what each tool was called with and what it returned, so an evaluator can check whether the final answer is grounded in real tool results.
tool_spans = primary_df[primary_df[SPAN_KIND] == "TOOL"].sort_values("start_time")


# tool_path: the ordered tool calls WITH their arguments (selection, order, args).
def format_decision_path(group):
    return " -> ".join(
        f"{row['name']}({row['attributes.input.value']})" for _, row in group.iterrows()
    )


trace_df["tool_path"] = (
    tool_spans.groupby("context.trace_id")[["name", "attributes.input.value"]]
    .apply(format_decision_path)
    .reindex(trace_df.index)
    .fillna("No tools called")
)


# tool_io: each tool call's input AND output.
def format_tool_calls(group):
    lines = []
    for i, (_, row) in enumerate(group.iterrows(), start=1):
        lines.append(
            f"{i}. {row['name']} | input: {row['attributes.input.value']} "
            f"| output: {row['attributes.output.value']}"
        )
    return "\n".join(lines)


trace_df["tool_io"] = (
    tool_spans.groupby("context.trace_id")[["name", "attributes.input.value", "attributes.output.value"]]
    .apply(format_tool_calls)
    .reindex(trace_df.index)
    .fillna("No tools called")
)

Define the three evaluators

Each evaluator reads a different column and answers a different question:
  1. Relevance — an endpoint check on input + output (exactly what end-to-end eval does).
  2. Decision path — an intermediate check on input + tool_path (right tools, right order, sensible arguments?).
  3. Reasoning / support — an intermediate check on input + tool_io + output (is the answer grounded in the actual tool results, or does it invent facts no tool produced?).
Each prompt is written to judge only its own concern. See the notebook for the full DECISION_PATH, RECOMMENDATION_RELEVANCE, and REASONING_SUPPORT templates; the decision-path judge, for example, never sees the final answer:
DECISION_PATH = """
You are evaluating an agent's DECISION PATH: the ordered tool calls it made to
answer a request — which tools, in what order, and with what arguments. You are
NOT judging the final answer text, and you are NOT judging whether each tool's
output was correct — only whether the agent's choices were sensible.

The agent has three tools available:
- movie_selector_llm(genre): returns candidate movies. Must come FIRST, because the
  other tools operate on the selected movies.
- reviewer_llm(movies): reviews and sorts movies. Only meaningful AFTER selection,
  and should be called with the movies that were actually selected.
- preview_summarizer_llm(movie): summarizes a movie. Only meaningful AFTER selection.

You will be given:
1. The user input that initiated the trace
2. The ordered tool calls the agent executed, with their arguments

##
User Input:
{input}

Decision Path (ordered tool calls with arguments):
{tool_path}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` ->
- movie_selector_llm is called before reviewer_llm or preview_summarizer_llm, AND
- each tool is called with sensible arguments (e.g. reviewer_llm receives the
  movies that were selected, not an empty or unrelated list).
2. `incorrect` ->
- a tool that operates on movies (reviewer_llm / preview_summarizer_llm) runs
  before any movies have been selected, the selection step is missing, OR a tool is
  called with nonsensical arguments.
"""
Run all three judges with arize-phoenix-evals. Each ClassificationEvaluator reads the columns named in its template and returns a Score (label + numeric score + explanation). Wrap the run in suppress_tracing() so the judges’ own LLM calls don’t get traced back into the same project:
from phoenix.evals import LLM, ClassificationEvaluator, async_evaluate_dataframe
from openinference.instrumentation import suppress_tracing
import nest_asyncio

nest_asyncio.apply()

judge_llm = LLM(provider="openai", model="gpt-4.1-mini")

relevance_evaluator = ClassificationEvaluator(
    name="relevance", llm=judge_llm, prompt_template=RECOMMENDATION_RELEVANCE,
    choices={"correct": 1.0, "incorrect": 0.0},
)
path_evaluator = ClassificationEvaluator(
    name="decision_path", llm=judge_llm, prompt_template=DECISION_PATH,
    choices={"correct": 1.0, "incorrect": 0.0},
)
reasoning_evaluator = ClassificationEvaluator(
    name="reasoning", llm=judge_llm, prompt_template=REASONING_SUPPORT,
    choices={"correct": 1.0, "incorrect": 0.0},
)
EVALUATORS = [relevance_evaluator, path_evaluator, reasoning_evaluator]

with suppress_tracing():
    results_df = await async_evaluate_dataframe(dataframe=trace_df, evaluators=EVALUATORS)
Each evaluator writes a <name>_score column holding a Score dict (label, score, explanation) — so the columns above are relevance_score, decision_path_score, and reasoning_score.

Seeing what each evaluator catches

On well-behaved traces the three evaluators agree. The point of trace-level evaluation is what happens when they don’t. Running the three judges on controlled cases that mirror the real schema — each with a relevant-looking answer, so the endpoint check passes every time — shows each intermediate check catching a distinct failure:
caserelevance (endpoint)decision pathreasoning / support
clean runcorrectcorrectcorrect
broken order (reviews before selecting)correctincorrectcorrect
unsupported claim (cites a rating no tool returned)correctcorrectincorrect
Two intermediate failures, two different lenses — and both invisible to the endpoint check.

Log Results Back to Arize AX

Attach each trace’s three evaluations to its root span, using the trace_eval.<name>.label/score/explanation column convention that Arize AX expects:
import json


def _unpack(cell):
    """Phoenix returns each evaluator's result as a Score dict in `<name>_score`."""
    if isinstance(cell, str):
        try:
            cell = json.loads(cell)
        except json.JSONDecodeError:
            return None, None, None
    if isinstance(cell, dict):
        return cell.get("label"), cell.get("score"), cell.get("explanation")
    return getattr(cell, "label", None), getattr(cell, "score", None), getattr(cell, "explanation", None)


# Map each evaluator's score column to the Arize AX eval name shown in the UI.
EVAL_COLUMNS = {
    "RecommendationRelevance": "relevance_score",
    "DecisionPath": "decision_path_score",
    "ReasoningSupport": "reasoning_score",
}

eval_df = pd.DataFrame()
for ax_name, score_col in EVAL_COLUMNS.items():
    unpacked = results_df[score_col].apply(_unpack)
    eval_df[f"trace_eval.{ax_name}.label"] = unpacked.map(lambda t: t[0]).values
    eval_df[f"trace_eval.{ax_name}.score"] = unpacked.map(lambda t: t[1]).values
    eval_df[f"trace_eval.{ax_name}.explanation"] = unpacked.map(lambda t: t[2]).values
eval_df["context.trace_id"] = trace_df.index.values

# Map each trace to its root (AGENT) span so the evals attach to the root span.
# update_evaluations needs a context.span_id column (drop=False keeps it).
root_spans = agent_roots[["context.trace_id", "context.span_id"]]
log_df = eval_df.merge(root_spans, on="context.trace_id", how="inner").set_index(
    "context.span_id", drop=False
)

# Reuse the ArizeClient from the export step.
resp = ax_client.spans.update_evaluations(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=model_id,
    dataframe=log_df,
)

View Results in Arize AX

After logging the evaluations, you can view the results in the Traces tab of your Arize AX project. Each trace’s root span carries all three labels — RecommendationRelevance, DecisionPath, and ReasoningSupport — with the judge’s score and explanation, so you can:
  • Monitor trace-level performance metrics
  • Spot answers that look right but reached the goal through a broken path
  • Track recommendation quality, decision-path correctness, and grounding side by side

Takeaway

We ran three evaluators over the same traces, each asking a different question and reading a different signal:
  • relevance (endpoint) — did the final answer look right? Reads only input and output.
  • decision path (intermediate) — did the agent pick the right tools, in the right order, with sensible arguments? Reads tool_path.
  • reasoning / support (intermediate) — is the answer grounded in what the tools returned? Reads tool_io.
The lesson: a good-looking answer can hide a broken process, and only intermediate evals — reading signals reconstructed from spans — can see it. The pattern generalizes: for any intermediate step you care about, reconstruct the relevant signal from spans into a column, write a judge that reads that column, and run it alongside your endpoint eval.