Agent Trajectory Evaluations

When an agent tackles a task it usually takes multiple steps—invoking tools, writing code, making API calls, and reasoning along the way. Even if the final answer is right, a poor sequence of steps can waste time, money, or expose users to risk.

This guide shows how to measure the quality of an agent's internal trajectory using Arize Phoenix Evals and log the results back to Arize for monitoring.

A Colab notebook that walks through the complete workflow is available in the Agent Trajectory Evaluation Notebook.


Prerequisites

  1. Instrumented traces of your agent with the OpenInference schema

  2. Python 3.10+ and the following packages:

pip install arize arize-phoenix arize-phoenix-evals pandas openai nest_asyncio

Implementation

1 Pull trace data from Arize

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient()

df = client.export_model_to_df(
    space_id = "YOUR_SPACE_ID",
    model_id = "YOUR_MODEL_ID",
    environment = Environments.TRACING,
    start_time = datetime.now(timezone.utc) - timedelta(days=7),
    end_time   = datetime.now(timezone.utc),
)

2 Filter to the spans you want to score

Most agents emit many spans (retrieval, LLM calls, DB writes, …). For trajectory scoring we usually care about LLM spans that contain tool calls.

from typing import Dict, Any
import pandas as pd

# A reusable helper that applies both trace-level and span-level filters
from agent_trajectory_utils import filter_spans_by_trace_criteria  # provided in the notebook

trajectory_spans = filter_spans_by_trace_criteria(
    df            = df,
    trace_filters = {"name": {"contains": "searchrouter"}},      # tailor to your app
    span_filters  = {"attributes.openinference.span.kind": {"==": "LLM"}},
)

3 Extract ordered tool calls for each trace

from agent_trajectory_utils import (
    extract_tool_calls,
    prepare_trace_data_for_evaluation,
)

# Parse `attributes.llm.output_messages` → list of {name, arguments}
trajectory_spans["tool_calls"] = trajectory_spans[
    "attributes.llm.output_messages"
].apply(extract_tool_calls)

# Collapse every trace into a single row that contains its ordered tool calls
trace_df = prepare_trace_data_for_evaluation(
    df = trajectory_spans,
    extract_cols = {
        "tool_calls": "tool_calls",
        "attributes.llm.tools": "attributes.llm.tools",           # reference tool schema
        "attributes.input.value": "attributes.input.value",       # original user input
    },
)

4 Define the evaluation prompt

The LLM judge receives:

  • {tool_calls} – the actual trajectory (step → tool → arguments)

  • {attributes.input.value} – the user input that kicked off the trace

  • {attributes.llm.tools} – the JSON schema of available tools

  • (Optional) {reference_outputs} – a golden trajectory you expect

TRAJECTORY_ACCURACY_PROMPT = """
You are a helpful AI bot that checks whether an AI agent's internal trajectory is accurate and effective.

You will be given:
1. The agent's actual trajectory of tool calls
2. The user input that initiated the trajectory
3. The definition of each tool that can be called

An accurate trajectory:
- Progresses logically from step to step
- Uses the right tools for the task
- Is reasonably efficient (no unnecessary detours)

##
Actual Trajectory:
{tool_calls}

User Input:
{attributes.input.value}

Tool Definitions:
{attributes.llm.tools}
##

Respond with **exactly** one word: `correct` or `incorrect`.
- `correct` → trajectory adheres to the rubric and achieves the task.
- `incorrect` → trajectory is confusing, inefficient, or fails the task.
"""

5 Run the evaluation

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio, os

nest_asyncio.apply()

model = OpenAIModel(
    api_key = os.environ["OPENAI_API_KEY"],
    model   = "gpt-4o-mini",
    temperature = 0.0,
)

rails = ["correct", "incorrect"]
results = llm_classify(
    dataframe           = trace_df,
    template            = TRAJECTORY_ACCURACY_PROMPT,
    model               = model,
    rails               = rails,
    provide_explanation = True,   # add a free-text rationale for debugging
    verbose             = False,
)

6 Log the results back to Arize

Link the evaluation to the root span of each trace so you can slice & dice in the UI.

from arize.pandas.logger import Client

# Merge eval results with original trace data to grab span id
merged = trace_df.merge(results, left_index=True, right_index=True)
merged.rename(
    columns={
        "label": "trace_eval.AgentTrajectoryAccuracy.label",
        "explanation": "trace_eval.AgentTrajectoryAccuracy.explanation",
    },
    inplace=True,
)

root_spans = df[df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = merged.merge(root_spans, on="context.trace_id", how="left")
log_df.set_index("context.span_id", inplace=True)

arize_client = Client(
    space_id = os.environ["ARIZE_SPACE_ID"],
    api_key  = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
    dataframe = log_df,
    model_id  = os.environ["ARIZE_MODEL_ID"],
)

Last updated

Was this helpful?