Session-Level Evaluations
Evaluate entire user conversations to measure coherence, context retention, and overall goal achievement in your LLM applications.
While individual trace evaluations are useful for assessing single interactions, session-level evaluations allow you to analyze the entire lifecycle of a user's conversation with your AI agent or chatbot. This is crucial for understanding the overall user experience and identifying issues that only emerge over multiple turns.
A session-level evaluation assesses:
Coherence: Does the agent maintain a consistent and logical conversation flow?
Context Retention: Does the agent remember and correctly utilize information from earlier in the conversation?
Goal Achievement: Did the user successfully achieve their overall goal by the end of the session?
Task Progression: For multi-step tasks, does the conversation progress logically toward completion?

Why Use Session-Level Evaluations?
Evaluating entire sessions helps you uncover insights that are invisible at the single-prompt level. For example, a chatbot might answer a single question correctly but fail to handle a follow-up question, leading to a poor user experience.
By evaluating the entire session, you can:
Pinpoint where and why a conversation went off-track.
Identify patterns of user frustration or confusion over a series of interactions.
Measure and improve metrics related to overall user satisfaction and task success rates.
Validate the performance of complex, multi-turn agents.
How It Works
Session-level evaluations work by grouping all traces that share the same session.id
. The content from these traces (e.g., all user inputs and agent outputs) is aggregated and then assessed by an LLM-as-a-judge against a set of criteria you define.
The resulting evaluation (e.g., a "correctness" label and an explanation) is then attached to the first trace in the session, allowing you to filter for and analyze high- or low-performing sessions directly within Arize.
Getting Started
To learn how to implement session-level evaluations for your application, refer to our comprehensive guide and example notebook.
Last updated
Was this helpful?