Session-Level Evaluations

Evaluate entire user conversations to measure coherence, context retention, and overall goal achievement in your LLM applications.

While individual trace evaluations are useful for assessing single interactions, session-level evaluations allow you to analyze the entire lifecycle of a user's conversation with your AI agent or chatbot. This is crucial for understanding the overall user experience and identifying issues that only emerge over multiple turns.

A session-level evaluation assesses:

  • Coherence: Does the agent maintain a consistent and logical conversation flow?

  • Context Retention: Does the agent remember and correctly utilize information from earlier in the conversation?

  • Goal Achievement: Did the user successfully achieve their overall goal by the end of the session?

  • Task Progression: For multi-step tasks, does the conversation progress logically toward completion?

A view of the sessions tab in Arize, showing multiple traces grouped by a session ID.

Why Use Session-Level Evaluations?

Evaluating entire sessions helps you uncover insights that are invisible at the single-prompt level. For example, a chatbot might answer a single question correctly but fail to handle a follow-up question, leading to a poor user experience.

By evaluating the entire session, you can:

  • Pinpoint where and why a conversation went off-track.

  • Identify patterns of user frustration or confusion over a series of interactions.

  • Measure and improve metrics related to overall user satisfaction and task success rates.

  • Validate the performance of complex, multi-turn agents.

How It Works

Session-level evaluations work by grouping all traces that share the same session.id. The content from these traces (e.g., all user inputs and agent outputs) is aggregated and then assessed by an LLM-as-a-judge against a set of criteria you define.

The resulting evaluation (e.g., a "correctness" label and an explanation) is then attached to the first trace in the session, allowing you to filter for and analyze high- or low-performing sessions directly within Arize.

Getting Started

To learn how to implement session-level evaluations for your application, refer to our comprehensive guide and example notebook.

Last updated

Was this helpful?