Session-Level Evaluation: Scoring the Whole Conversation

Most evals score a single turn. But a tutor — like a support agent or any assistant you talk to over time — isn’t one turn, it’s a session, and the properties you care about are emergent: they only exist across the whole conversation. Coherence is a relationship between turns; goal completion happens over the arc of the session; frustration builds. The trap: every individual turn can look fine while the session as a whole fails. Turn-level checks pass each turn and miss it. The method: trace the full multi-turn conversation under a shared session_id, aggregate its spans into one ordered transcript, run session-scoped judges that read the whole conversation, then log each result back onto the session. We’ll go through the following steps:

Run a multi-turn AI tutor traced as sessions — with a simulated student so the notebook runs top-to-bottom, no manual typing
Aggregate each session’s spans into one ordered transcript
Evaluate each session on four session-only dimensions — coherence, goal completion, frustration, and correctness
See a controlled example where every turn looks fine but the session fails
Log the results back to Arize AX as session-level evaluations

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or walkthrough video.

Session level evals for chatbot guide

Build AI Tutor with Session Tracking

The tutor teaches Socratically across several turns. To keep the notebook runnable with no manual typing, a second LLM call plays the student. Only the tutor calls are wrapped in using_attributes(session_id=...), so each session’s spans share a session.id; the student calls run under suppress_tracing() so they stay out of the project.

import uuid

from openinference.instrumentation import suppress_tracing, using_attributes

MODEL = "gpt-5.4-mini"


def run_session(user_id: str, topic: str, question: str, persona: str, max_turns: int = 5) -> str:
    session_id = f"tutor-{uuid.uuid4()}"
    system = (
        f"You are a thoughtful AI tutor teaching {topic}. "
        "Ask questions, give hints, and only give the full answer once the student "
        "shows correct reasoning. Keep each reply to 3-5 sentences."
    )
    messages = [{"role": "user", "content": question}]
    transcript = [("student", question)]

    for _ in range(max_turns):
        # Only the tutor call carries the session attributes.
        with using_attributes(session_id=session_id, user_id=user_id):
            resp = client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "system", "content": system}] + messages,
            )
        tutor_text = resp.choices[0].message.content.strip()
        messages.append({"role": "assistant", "content": tutor_text})
        transcript.append(("tutor", tutor_text))

        student_text = student_reply(persona, transcript)  # untraced, see notebook
        if student_text.strip().upper().startswith("DONE"):
            break
        messages.append({"role": "user", "content": student_text})
        transcript.append(("student", student_text))

    return session_id

Aggregate Spans into Session Transcripts

Export every span, group by attributes.session.id, and rebuild a clean role-tagged transcript from each tutor turn’s structured llm.input_messages / llm.output_messages. We also keep one context.span_id per session — the handle we’ll attach the session evals onto at the end.

from datetime import datetime, timedelta, timezone

import pandas as pd
from arize.client import ArizeClient

ax_client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
primary_df = ax_client.spans.export_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=model_id,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

SESSION_ID = "attributes.session.id"


def prepare_sessions(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df[SESSION_ID].notna()]
    sessions = []
    for session_id, group in df.sort_values("start_time").groupby(SESSION_ID):
        lines = []
        for _, row in group.iterrows():
            inputs = _as_list(row.get("attributes.llm.input_messages"))  # helpers in notebook
            user_turns = [_message_text(m) for m in inputs if m.get("message.role") == "user"]
            if user_turns and user_turns[-1]:
                lines.append(f"user: {user_turns[-1]}")
            for m in _as_list(row.get("attributes.llm.output_messages")):
                if (text := _message_text(m)):
                    lines.append(f"assistant: {text}")
        sessions.append(
            {
                "session_id": session_id,
                "messages": "\n\n".join(lines),
                "trace_count": group["context.trace_id"].nunique(),
                "span_id": group["context.span_id"].iloc[0],  # handle for logging
            }
        )
    return pd.DataFrame(sessions)


sessions_df = prepare_sessions(primary_df)

Define the Session-Level Evaluators

Each evaluator reads the entire transcript and scores one session-only property — none could be computed from a single turn. All four read the same messages column and run together with async_evaluate_dataframe. See the notebook for the full prompts.

from phoenix.evals import LLM, ClassificationEvaluator, async_evaluate_dataframe

judge = LLM(provider="openai", model="gpt-4.1")

coherence_evaluator = ClassificationEvaluator(
    name="coherence", llm=judge, prompt_template=SESSION_COHERENCE_PROMPT,
    choices={"coherent": 1.0, "incoherent": 0.0},
)
goal_completion_evaluator = ClassificationEvaluator(
    name="goal_completion", llm=judge, prompt_template=SESSION_GOAL_COMPLETION_PROMPT,
    choices={"completed": 1.0, "not_completed": 0.0},
)
frustration_evaluator = ClassificationEvaluator(
    name="frustration", llm=judge, prompt_template=SESSION_FRUSTRATION_PROMPT,
    choices={"not_frustrated": 1.0, "frustrated": 0.0},
)
correctness_evaluator = ClassificationEvaluator(
    name="correctness", llm=judge, prompt_template=SESSION_CORRECTNESS_PROMPT,
    choices={"correct": 1.0, "incorrect": 0.0},
)
session_evaluators = [
    coherence_evaluator, goal_completion_evaluator, frustration_evaluator, correctness_evaluator,
]

with suppress_tracing():
    results_df = await async_evaluate_dataframe(
        dataframe=sessions_df, evaluators=session_evaluators, concurrency=10
    )

Seeing what session-level evals catch

Running the four judges on four hand-written transcripts — each built so every turn is locally fine but one session property fails — shows each dimension catching its own failure. A turn-by-turn check would pass every individual turn in all four:

case	coherence	goal_completion	frustration	correctness
clean	coherent	completed	not_frustrated	correct
incoherent (tutor contradicts itself)	incoherent	completed	not_frustrated	incorrect
goal not met (drifts onto tangents)	coherent	not_completed	not_frustrated	correct
frustrated (student visibly impatient)	coherent	completed	frustrated	correct

The incoherent row shows a real cascade: a self-contradiction also reads as less correct, because these dimensions aren’t fully independent — itself worth seeing.

Log Evaluations Back to Arize AX

Arize AX routes columns named session_eval.<name>.label/score/explanation to the session that the row’s context.span_id belongs to — so we attach each session’s four results to one of its spans (the span_id we kept earlier). They appear on each session in the Sessions tab.

def _unpack(cell):
    """async_evaluate_dataframe stores each evaluator's result as a Score dict."""
    if isinstance(cell, dict):
        return cell.get("label"), cell.get("score"), cell.get("explanation")
    return getattr(cell, "label", None), getattr(cell, "score", None), getattr(cell, "explanation", None)


SESSION_EVALS = ["coherence", "goal_completion", "frustration", "correctness"]

eval_df = pd.DataFrame()
for name in SESSION_EVALS:
    unpacked = results_df[f"{name}_score"].apply(_unpack)
    eval_df[f"session_eval.{name}.label"] = unpacked.map(lambda t: t[0]).values
    eval_df[f"session_eval.{name}.score"] = unpacked.map(lambda t: t[1]).values
    eval_df[f"session_eval.{name}.explanation"] = unpacked.map(lambda t: t[2]).values

# Attach to one span per session (drop=False keeps the column update_evaluations requires);
# Arize AX routes session_eval.* to that span's session.
span_for_session = dict(zip(sessions_df["session_id"], sessions_df["span_id"]))
eval_df["context.span_id"] = [span_for_session[sid] for sid in results_df["session_id"]]
log_df = eval_df.set_index("context.span_id", drop=False)

resp = ax_client.spans.update_evaluations(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name=model_id,
    dataframe=log_df,
)

View Results in Arize AX

After logging, open the Sessions tab of your Arize AX project. Each session carries its four session-level evaluations — coherence, goal_completion, frustration, and correctness — each with a label, score, and explanation, letting you:

Monitor session-level quality, not just per-turn
Spot sessions that pass every turn yet fail as a whole — an unmet goal, a contradiction, a frustrated user
Track engagement and goal completion across sessions

The pattern generalizes: for any session-only property, aggregate the session’s spans into one transcript, write a judge that reads the whole thing, and log it back as a session-level evaluation — right next to your turn-level and trace-level evals.

​Notebook Walkthrough

Session level evals for chatbot guide

​Build AI Tutor with Session Tracking

​Aggregate Spans into Session Transcripts

​Define the Session-Level Evaluators

​Seeing what session-level evals catch

​Log Evaluations Back to Arize AX

​View Results in Arize AX