Trace & Session Evals

Choosing the right evaluation scope serves different purposes depending on what you want to assess. This ensures your evaluation has the appropriate data to actually measure performance. If you’re asking “Did this specific step work as intended?”, a span-level evaluation helps isolate and measure performance at the most granular level. If your goal is to understand “Did the model reason effectively across steps to reach the right conclusion?”, a trace-level evaluation gives visibility into reasoning quality, context retention, and flow. When the question becomes “Did this interaction achieve what the user needed?”, a session-level evaluation captures the overall effectiveness and experience of the system.

Each evaluation scope aligns with a different type of insight:

Span-Level Evaluation: Evaluates an individual step, such as a single LLM call or tool call.
Trace-Level Evaluation: Evaluates a full chain of steps to assess reasoning quality and flow across multiple calls.
Session-Level Evaluation: Evaluates the overall end-to-end interaction or conversation, focusing on user experience and outcome quality.

Code Evals Span-Level Evals

⌘I

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings