Agent Trajectory Evaluations
When an agent tackles a task it usually takes multiple steps—invoking tools, writing code, making API calls, and reasoning along the way. Even if the final answer is right, a poor sequence of steps can waste time, money, or expose users to risk.
This guide shows how to measure the quality of an agent's internal trajectory using Arize Phoenix Evals and log the results back to Arize for monitoring.
Prerequisites
Instrumented traces of your agent with the OpenInference schema
Python 3.10+ and the following packages:
pip install arize arize-phoenix arize-phoenix-evals pandas openai nest_asyncio
Implementation
1 Pull trace data from Arize
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone
client = ArizeExportClient()
df = client.export_model_to_df(
space_id = "YOUR_SPACE_ID",
model_id = "YOUR_MODEL_ID",
environment = Environments.TRACING,
start_time = datetime.now(timezone.utc) - timedelta(days=7),
end_time = datetime.now(timezone.utc),
)
2 Filter to the spans you want to score
Most agents emit many spans (retrieval, LLM calls, DB writes, …). For trajectory scoring we usually care about LLM spans that contain tool calls.
from typing import Dict, Any
import pandas as pd
# A reusable helper that applies both trace-level and span-level filters
from agent_trajectory_utils import filter_spans_by_trace_criteria # provided in the notebook
trajectory_spans = filter_spans_by_trace_criteria(
df = df,
trace_filters = {"name": {"contains": "searchrouter"}}, # tailor to your app
span_filters = {"attributes.openinference.span.kind": {"==": "LLM"}},
)
3 Extract ordered tool calls for each trace
from agent_trajectory_utils import (
extract_tool_calls,
prepare_trace_data_for_evaluation,
)
# Parse `attributes.llm.output_messages` → list of {name, arguments}
trajectory_spans["tool_calls"] = trajectory_spans[
"attributes.llm.output_messages"
].apply(extract_tool_calls)
# Collapse every trace into a single row that contains its ordered tool calls
trace_df = prepare_trace_data_for_evaluation(
df = trajectory_spans,
extract_cols = {
"tool_calls": "tool_calls",
"attributes.llm.tools": "attributes.llm.tools", # reference tool schema
"attributes.input.value": "attributes.input.value", # original user input
},
)
4 Define the evaluation prompt
The LLM judge receives:
{tool_calls}
– the actual trajectory (step → tool → arguments){attributes.input.value}
– the user input that kicked off the trace{attributes.llm.tools}
– the JSON schema of available tools(Optional)
{reference_outputs}
– a golden trajectory you expect
TRAJECTORY_ACCURACY_PROMPT = """
You are a helpful AI bot that checks whether an AI agent's internal trajectory is accurate and effective.
You will be given:
1. The agent's actual trajectory of tool calls
2. The user input that initiated the trajectory
3. The definition of each tool that can be called
An accurate trajectory:
- Progresses logically from step to step
- Uses the right tools for the task
- Is reasonably efficient (no unnecessary detours)
##
Actual Trajectory:
{tool_calls}
User Input:
{attributes.input.value}
Tool Definitions:
{attributes.llm.tools}
##
Respond with **exactly** one word: `correct` or `incorrect`.
- `correct` → trajectory adheres to the rubric and achieves the task.
- `incorrect` → trajectory is confusing, inefficient, or fails the task.
"""
5 Run the evaluation
from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio, os
nest_asyncio.apply()
model = OpenAIModel(
api_key = os.environ["OPENAI_API_KEY"],
model = "gpt-4o-mini",
temperature = 0.0,
)
rails = ["correct", "incorrect"]
results = llm_classify(
dataframe = trace_df,
template = TRAJECTORY_ACCURACY_PROMPT,
model = model,
rails = rails,
provide_explanation = True, # add a free-text rationale for debugging
verbose = False,
)
6 Log the results back to Arize
Link the evaluation to the root span of each trace so you can slice & dice in the UI.
from arize.pandas.logger import Client
# Merge eval results with original trace data to grab span id
merged = trace_df.merge(results, left_index=True, right_index=True)
merged.rename(
columns={
"label": "trace_eval.AgentTrajectoryAccuracy.label",
"explanation": "trace_eval.AgentTrajectoryAccuracy.explanation",
},
inplace=True,
)
root_spans = df[df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = merged.merge(root_spans, on="context.trace_id", how="left")
log_df.set_index("context.span_id", inplace=True)
arize_client = Client(
space_id = os.environ["ARIZE_SPACE_ID"],
api_key = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
dataframe = log_df,
model_id = os.environ["ARIZE_MODEL_ID"],
)
Last updated
Was this helpful?