Agent Trajectory Evaluations

Evaluate and monitor the quality of an agent's step-by-step tool-calling trajectory across traces.

When an agent tackles a task it usually takes multiple steps—invoking tools, writing code, making API calls, and reasoning along the way. Even if the final answer is right, a poor sequence of steps can waste time, money, or expose users to risk.

Individual span or trace evaluations check that one step or response is correct, but they can miss costly mistakes an agent makes between steps. Agent trajectory evaluations measure the entire sequence of tool calls an agent takes to solve a task.

Screenshot of trajectory evaluation labels in Arize

A Colab notebook that walks through the complete workflow is available in the Agent Trajectory Evaluation Notebook.

How It Works

Group tool-calling spans per trace – each tool call (function call) is captured as a span when you instrument with OpenInference.
Send the ordered list of tool calls to an LLM judge – Phoenix Evals classifies the trajectory as correct or incorrect (and can produce an explanation).
Log the evaluation back to Arize – the result is attached to the root span of the trace so you can filter and pivot in the UI.

Prerequisites

Instrumented traces of your agent with the OpenInference schema
Python 3.10+ and the following packages:

pip install arize arize-phoenix arize-phoenix-evals pandas openai nest_asyncio

Implementation

1. Pull trace data from Arize

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient()

df = client.export_model_to_df(
    space_id = "YOUR_SPACE_ID",
    model_id = "YOUR_MODEL_ID",
    environment = Environments.TRACING,
    start_time = datetime.now(timezone.utc) - timedelta(days=7),
    end_time   = datetime.now(timezone.utc),
)

2. Filter to the spans you want to score

Most agents emit many spans (retrieval, LLM calls, DB writes, …). For trajectory scoring we usually care about LLM spans that contain tool calls.

from typing import Dict, Any
import pandas as pd

# A reusable helper that applies both trace-level and span-level filters
from agent_trajectory_utils import filter_spans_by_trace_criteria  # provided in the notebook

trajectory_spans = filter_spans_by_trace_criteria(
    df            = df,
    trace_filters = {"name": {"contains": "searchrouter"}},      # tailor to your app
    span_filters  = {"attributes.openinference.span.kind": {"==": "LLM"}},
)

3. Extract ordered tool calls for each trace

from agent_trajectory_utils import (
    extract_tool_calls,
    prepare_trace_data_for_evaluation,
)

# Parse `attributes.llm.output_messages` → list of {name, arguments}
trajectory_spans["tool_calls"] = trajectory_spans[
    "attributes.llm.output_messages"
].apply(extract_tool_calls)

# Collapse every trace into a single row that contains its ordered tool calls
trace_df = prepare_trace_data_for_evaluation(
    df = trajectory_spans,
    extract_cols = {
        "tool_calls": "tool_calls",
        "attributes.llm.tools": "attributes.llm.tools",           # reference tool schema
        "attributes.input.value": "attributes.input.value",       # original user input
    },
)

4. Define the evaluation prompt

The LLM judge receives:

{tool_calls} – the actual trajectory (step → tool → arguments)
{attributes.input.value} – the user input that kicked off the trace
{attributes.llm.tools} – the JSON schema of available tools
(Optional) {reference_outputs} – a golden trajectory you expect

TRAJECTORY_ACCURACY_PROMPT = """
You are a helpful AI bot that checks whether an AI agent's internal trajectory is accurate and effective.

You will be given:
1. The agent's actual trajectory of tool calls
2. The user input that initiated the trajectory
3. The definition of each tool that can be called

An accurate trajectory:
- Progresses logically from step to step
- Uses the right tools for the task
- Is reasonably efficient (no unnecessary detours)

##
Actual Trajectory:
{tool_calls}

User Input:
{attributes.input.value}

Tool Definitions:
{attributes.llm.tools}
##

Respond with **exactly** one word: `correct` or `incorrect`.
- `correct` → trajectory adheres to the rubric and achieves the task.
- `incorrect` → trajectory is confusing, inefficient, or fails the task.
"""

5. Run the evaluation

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio, os

nest_asyncio.apply()

model = OpenAIModel(
    api_key = os.environ["OPENAI_API_KEY"],
    model   = "gpt-4o-mini",
    temperature = 0.0,
)

rails = ["correct", "incorrect"]
results = llm_classify(
    dataframe           = trace_df,
    template            = TRAJECTORY_ACCURACY_PROMPT,
    model               = model,
    rails               = rails,
    provide_explanation = True,   # add a free-text rationale for debugging
    verbose             = False,
)

6. Log the results back to Arize

Link the evaluation to the root span of each trace so you can slice & dice in the UI.

from arize.pandas.logger import Client

# Merge eval results with original trace data to grab span id
merged = trace_df.merge(results, left_index=True, right_index=True)
merged.rename(
    columns={
        "label": "trace_eval.AgentTrajectoryAccuracy.label",
        "explanation": "trace_eval.AgentTrajectoryAccuracy.explanation",
    },
    inplace=True,
)

root_spans = df[df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = merged.merge(root_spans, on="context.trace_id", how="left")
log_df.set_index("context.span_id", inplace=True)

arize_client = Client(
    space_id = os.environ["ARIZE_SPACE_ID"],
    api_key  = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
    dataframe = log_df,
    model_id  = os.environ["ARIZE_MODEL_ID"],
)

Last updated 5 days ago

Was this helpful?