Agent Trajectory Evaluations

Evaluate and monitor the quality of an agent's step-by-step tool-calling trajectory across traces.

Individual span or trace evaluations check that one step or response is correct, but they can miss costly mistakes an agent makes between steps. Agent trajectory evaluations measure the entire sequence of tool calls an agent takes to solve a task.

Tracking trajectory quality helps you:

  • Detect loops, unnecessary steps, or wrong tools that inflate cost and latency

  • Validate that the agent follows an expected "golden path"

  • Compare different agent versions or prompt strategies

  • Debug outlier traces directly in Arize

Screenshot of trajectory evaluation labels in Arize

How It Works

  1. Group tool-calling spans per trace – each tool call (function call) is captured as a span when you instrument with OpenInference.

  2. Send the ordered list of tool calls to an LLM judge – Phoenix Evals classifies the trajectory as correct or incorrect (and can produce an explanation).

  3. Log the evaluation back to Arize – the result is attached to the root span of the trace so you can filter and pivot in the UI.

Getting Started

Follow the step-by-step implementation (with full code) in our Cookbook guide:

Last updated

Was this helpful?