Use this file to discover all available pages before exploring further.
When an agent tackles a task it usually takes multiple steps—invoking tools, writing code, making API calls, and reasoning along the way. Even if the final answer is right, a poor sequence of steps can waste time, money, or expose users to risk.Individual span or trace evaluations check that one step or response is correct, but they can miss costly mistakes an agent makes between steps. Agent trajectory evaluations measure the entire sequence of tool calls an agent takes to solve a task.
Group tool-calling spans per trace – each tool call (function call) is captured as a span when you instrument with OpenInference.
Send the ordered list of tool calls to an LLM judge – Phoenix Evals classifies the trajectory as correct or incorrect (and can produce an explanation).
Log the evaluation back to Arize – the result is attached to the root span of the trace so you can filter and pivot in the UI.
Most agents emit many spans (retrieval, LLM calls, DB writes, …). For trajectory scoring we usually care about LLM spans that contain tool calls.
from typing import Dict, Anyimport pandas as pd# A reusable helper that applies both trace-level and span-level filtersfrom agent_trajectory_utils import filter_spans_by_trace_criteria # provided in the notebooktrajectory_spans = filter_spans_by_trace_criteria( df = df, trace_filters = {"name": {"contains": "searchrouter"}}, # tailor to your app span_filters = {"attributes.openinference.span.kind": {"==": "LLM"}},)
from agent_trajectory_utils import ( extract_tool_calls, prepare_trace_data_for_evaluation,)# Parse `attributes.llm.output_messages` → list of {name, arguments}trajectory_spans["tool_calls"] = trajectory_spans[ "attributes.llm.output_messages"].apply(extract_tool_calls)# Collapse every trace into a single row that contains its ordered tool callstrace_df = prepare_trace_data_for_evaluation( df = trajectory_spans, extract_cols = { "tool_calls": "tool_calls", "attributes.llm.tools": "attributes.llm.tools", # reference tool schema "attributes.input.value": "attributes.input.value", # original user input },)
{tool_calls} – the actual trajectory (step → tool → arguments)
{attributes.input.value} – the user input that kicked off the trace
{attributes.llm.tools} – the JSON schema of available tools
(Optional){reference_outputs} – a golden trajectory you expect
TRAJECTORY_ACCURACY_PROMPT = """You are a helpful AI bot that checks whether an AI agent's internal trajectory is accurate and effective.You will be given:1. The agent's actual trajectory of tool calls2. The user input that initiated the trajectory3. The definition of each tool that can be calledAn accurate trajectory:- Progresses logically from step to step- Uses the right tools for the task- Is reasonably efficient (no unnecessary detours)##Actual Trajectory:{tool_calls}User Input:{attributes.input.value}Tool Definitions:{attributes.llm.tools}##Respond with **exactly** one word: `correct` or `incorrect`.- `correct` → trajectory adheres to the rubric and achieves the task.- `incorrect` → trajectory is confusing, inefficient, or fails the task."""