Agent Trajectory Evaluations
Evaluate and monitor the quality of an agent's step-by-step tool-calling trajectory across traces.
Individual span or trace evaluations check that one step or response is correct, but they can miss costly mistakes an agent makes between steps. Agent trajectory evaluations measure the entire sequence of tool calls an agent takes to solve a task.
Tracking trajectory quality helps you:
Detect loops, unnecessary steps, or wrong tools that inflate cost and latency
Validate that the agent follows an expected "golden path"
Compare different agent versions or prompt strategies
Debug outlier traces directly in Arize

How It Works
Group tool-calling spans per trace – each tool call (function call) is captured as a span when you instrument with OpenInference.
Send the ordered list of tool calls to an LLM judge – Phoenix Evals classifies the trajectory as
correct
orincorrect
(and can produce an explanation).Log the evaluation back to Arize – the result is attached to the root span of the trace so you can filter and pivot in the UI.
Getting Started
Follow the step-by-step implementation (with full code) in our Cookbook guide:
Last updated
Was this helpful?