session_id, aggregate its spans into one ordered transcript, run session-scoped judges that read the whole conversation, then log each result back onto the session.
We’ll go through the following steps:
- Run a multi-turn AI tutor traced as sessions — with a simulated student so the notebook runs top-to-bottom, no manual typing
- Aggregate each session’s spans into one ordered transcript
- Evaluate each session on four session-only dimensions — coherence, goal completion, frustration, and correctness
- See a controlled example where every turn looks fine but the session fails
- Log the results back to Arize AX as session-level evaluations
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or walkthrough video.Session level evals for chatbot guide
Build AI Tutor with Session Tracking
The tutor teaches Socratically across several turns. To keep the notebook runnable with no manual typing, a second LLM call plays the student. Only the tutor calls are wrapped inusing_attributes(session_id=...), so each session’s spans share a session.id; the student calls run under suppress_tracing() so they stay out of the project.
Aggregate Spans into Session Transcripts
Export every span, group byattributes.session.id, and rebuild a clean role-tagged transcript from each tutor turn’s structured llm.input_messages / llm.output_messages. We also keep one context.span_id per session — the handle we’ll attach the session evals onto at the end.
Define the Session-Level Evaluators
Each evaluator reads the entire transcript and scores one session-only property — none could be computed from a single turn. All four read the samemessages column and run together with async_evaluate_dataframe. See the notebook for the full prompts.
Seeing what session-level evals catch
Running the four judges on four hand-written transcripts — each built so every turn is locally fine but one session property fails — shows each dimension catching its own failure. A turn-by-turn check would pass every individual turn in all four:| case | coherence | goal_completion | frustration | correctness |
|---|---|---|---|---|
| clean | coherent | completed | not_frustrated | correct |
| incoherent (tutor contradicts itself) | incoherent | completed | not_frustrated | incorrect |
| goal not met (drifts onto tangents) | coherent | not_completed | not_frustrated | correct |
| frustrated (student visibly impatient) | coherent | completed | frustrated | correct |
incoherent row shows a real cascade: a self-contradiction also reads as less correct, because these dimensions aren’t fully independent — itself worth seeing.
Log Evaluations Back to Arize AX
Arize AX routes columns namedsession_eval.<name>.label/score/explanation to the session that the row’s context.span_id belongs to — so we attach each session’s four results to one of its spans (the span_id we kept earlier). They appear on each session in the Sessions tab.
View Results in Arize AX
After logging, open the Sessions tab of your Arize AX project. Each session carries its four session-level evaluations —coherence, goal_completion, frustration, and correctness — each with a label, score, and explanation, letting you:
- Monitor session-level quality, not just per-turn
- Spot sessions that pass every turn yet fail as a whole — an unmet goal, a contradiction, a frustrated user
- Track engagement and goal completion across sessions