Session-Level Evaluations for an AI Tutor

This tutorial shows you how to run session-level evaluations on conversations with an AI tutor using Arize.

Session-level evaluations provide a holistic view of entire interactions, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.

We'll go through the following steps:

  • Set up tracing for multi-turn AI tutor conversations

  • Aggregate spans into structured sessions with truncation support

  • Evaluate sessions across multiple dimensions (Correctness, Goal Completion, Frustration)

  • Format evaluation outputs to match Arize's schema

  • Log results back to Arize for monitoring and analysis


Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or walkthrough video.

Build AI Tutor with Session Tracking

Create an AI tutor that tracks conversations using using_attributes:

import uuid
import anthropic
from openinference.instrumentation import using_attributes

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def run_session(user_id: str, topic: str, question: str):
    session_id = f"tutor-{uuid.uuid4()}"
    chat = [
        {"role": "system", "content": (
            f"You are a thoughtful AI tutor teaching {topic}. "
            "Ask questions, give hints, and only suggest full answers "
            "when student shows correct reasoning."
        )},
        {"role": "user", "content": question},
    ]

    while True:
        with using_attributes(session_id=session_id, user_id=user_id):
            messages = []
            for msg in chat:
                if msg["role"] == "system":
                    if not messages:
                        messages.append({"role": "user", "content": msg["content"]})
                else:
                    messages.append(msg)
            
            resp = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                messages=messages,
                max_tokens=1000,
                temperature=0.5,
            )
        assistant_msg = resp.content[0].text.strip()
        assistant_msg += "\n\n(You can type 'DONE' if you're finished.)"

        chat.append({"role": "assistant", "content": assistant_msg})
        print(f"Tutor: {assistant_msg}")

        student_input = input("> your answer: ")
        if student_input.strip().upper() == "DONE":
            print("✅ Student is DONE — ending session.")
            break

        chat.append({"role": "user", "content": student_input})
    return session_id

Prepare Spans for Session-Level Evaluation

Use the Arize Client to export your spans as a dataframe:

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient(api_key=os.environ["API_KEY"])

primary_df = client.export_model_to_df(
    space_id=os.environ["SPACE_ID"],
    model_id=model_id,
    environment=Environments.TRACING,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

Group Spans by Session with Truncation

Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the session messages if token limits are exceeded. This prevents context window issues for longer sessions.

import pandas as pd

def truncate_text(text, max_chars, strategy="end"):
    """Truncate text to max_chars using the specified strategy."""
    if not text or len(text) <= max_chars:
        return text

    if strategy == "start":
        return "..." + text[-(max_chars - 3):]
    elif strategy == "middle":
        half = (max_chars - 3) // 2
        return text[:half] + "..." + text[-half:]
    else:  # "end"
        return text[:max_chars - 3] + "..."

def estimate_session_size(user_inputs, output_messages):
    """Estimate total character count of session content."""
    total_chars = sum(len(msg) for msg in user_inputs + output_messages if isinstance(msg, str))
    return total_chars

def prepare_sessions(
    df: pd.DataFrame,
    max_chars_per_value=5000,
    max_chars_per_session=100000,
    truncation_strategy="end"
) -> pd.DataFrame:
    """
    Collapse spans into a single row per session with truncation support.
    """
    sessions = []

    # Sort and group
    grouped = df.sort_values("start_time").groupby("attributes.session.id", as_index=False)

    for session_id, group in grouped:
        # Drop NA values and apply per-value truncation
        user_inputs = [
            truncate_text(msg, max_chars_per_value, truncation_strategy)
            for msg in group["attributes.input.value"].dropna().tolist()
        ]
        output_messages = [
            truncate_text(msg, max_chars_per_value, truncation_strategy)
            for msg in group["attributes.output.value"].dropna().tolist()
        ]

        # Estimate total session size
        total_chars = estimate_session_size(user_inputs, output_messages)

        # Truncate session-level size if needed
        if total_chars > max_chars_per_session:
            print(f"Session {session_id} exceeds {max_chars_per_session} chars. Truncating...")

            # Keep messages evenly from start and end (half-half)
            def smart_truncate(msgs):
                keep_half = len(msgs) // 2
                return msgs[:keep_half // 2] + msgs[-(keep_half - keep_half // 2):]

            user_inputs = smart_truncate(user_inputs)
            output_messages = smart_truncate(output_messages)

            # Optional: truncate remaining messages again more aggressively
            total_chars = estimate_session_size(user_inputs, output_messages)
            if total_chars > max_chars_per_session:
                aggressive_limit = max_chars_per_value // 2
                user_inputs = [truncate_text(m, aggressive_limit, truncation_strategy) for m in user_inputs]
                output_messages = [truncate_text(m, aggressive_limit, truncation_strategy) for m in output_messages]

        sessions.append({
            "session_id": session_id,
            "user_inputs": user_inputs,
            "output_messages": output_messages,
            "trace_count": group["context.trace_id"].nunique(),
        })

    return pd.DataFrame(sessions)

sessions_df = prepare_sessions(primary_df, truncation_strategy="middle")

Session Evaluations

Session Correctness Evaluation

Evaluate if the AI tutor provides factually accurate and educationally sound responses:

from phoenix.evals import llm_classify, AnthropicModel
import nest_asyncio

nest_asyncio.apply()

SESSION_CORRECTNESS_PROMPT = """
You are an expert tutor assistant evaluating the **correctness and educational quality** of an AI tutor's session with a student.

A session consists of multiple traces (interactions) between a student and an AI tutor. I will provide you with:
1. The student's messages (user inputs) in order.
2. The AI tutor's responses (output messages) in order.

An effective and correct tutoring session should:
- Provide factually and conceptually accurate explanations
- Correctly answer student questions
- Clarify misunderstandings if they occur
- Build upon previous context in a coherent way
- Avoid hallucinations, vague responses, or incorrect reasoning

##
Student Inputs:
{user_inputs}

Tutor Outputs:
{output_messages}
##

Based on the above, evaluate the session **only for correctness and educational soundness**.

Respond with a single word: `correct` or `incorrect`.

- Respond with `correct` if the AI tutor consistently provides accurate, clear, and educationally sound answers.
- Respond with `incorrect` if the AI tutor gives factually wrong, misleading, or incoherent explanations at any point.
"""

# Configure your evaluation model
model = AnthropicModel(
    model="claude-3-7-sonnet-latest",
)

# Run the evaluation
rails = ["correct", "incorrect"]
eval_results_correctness = llm_classify(
    data=sessions_df,
    template=SESSION_CORRECTNESS_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

Session Frustration Evaluation

Evaluate if the student shows signs of frustration during the session:

SESSION_FRUSTRATION_PROMPT = """
You are an AI assistant evaluating whether a student became frustrated during a tutoring session with an AI tutor.

A session consists of multiple traces (interactions) between a student and an AI tutor. You will be given:
1. The student's messages (user inputs), in order.
2. The AI tutor's messages (output messages), in order.

Signs of student frustration may include:
- Repeating or rephrasing the same question multiple times
- Expressing confusion ("I don't get it", "This doesn't make sense", etc.)
- Disagreeing with the tutor's responses
- Asking for clarification frequently without resolution
- Expressing annoyance, impatience, or disengagement
- Abruptly ending the session

##
Student Inputs:
{user_inputs}

Tutor Outputs:
{output_messages}
##

Based on the above, evaluate whether the student showed signs of frustration at any point in the session.

Respond with a single word: `frustrated` or `not_frustrated`.

- Respond with `frustrated` if there is evidence of confusion, dissatisfaction, or emotional frustration.
- Respond with `not_frustrated` if the student appears to stay engaged and satisfied throughout.
"""

# Run the evaluation
rails = ["frustrated", "not_frustrated"]
eval_results_frustration = llm_classify(
    data=sessions_df,
    template=SESSION_FRUSTRATION_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

Session Goal Achievement Evaluation

Evaluate if the tutor successfully helped the student achieve their learning goals:

SESSION_GOAL_ACHIEVEMENT_PROMPT = """
You are an AI assistant evaluating whether the AI tutor successfully helped the student achieve their learning goals during a tutoring session.

A session consists of multiple interactions between a student and an AI tutor. You will be given:
1. The student's messages (user inputs), in chronological order.
2. The AI tutor's responses (output messages), in chronological order.

To determine if the student's goals were achieved, consider:
- Whether the AI tutor addressed the student's questions and requests directly
- Whether the explanations provided resolved the student's doubts or problems
- Whether the student's inputs indicate understanding or closure by the end
- Whether the conversation logically progressed toward completing the student's objectives

##
Student Inputs:
{user_inputs}

Tutor Outputs:
{output_messages}
##

Evaluate the session and respond with a single word: `achieved` or `not_achieved`.

- Respond with `achieved` if the tutoring session successfully met the student's learning goals and resolved their questions.
- Respond with `not_achieved` if the session left the student's questions unanswered or goals unmet.
"""

# Run the evaluation
rails = ["achieved", "not_achieved"]
eval_results_goal_achievement = llm_classify(
    data=sessions_df,
    template=SESSION_GOAL_ACHIEVEMENT_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

Log Evaluations Back to Arize

Format and log the evaluation results back to Arize for monitoring:

import pandas as pd
from arize.pandas.logger import Client

# Rename columns to match Arize schema
eval_results_correctness = eval_results_correctness.rename(columns={
    "label": "SessionCorrectness.label",
    "explanation": "SessionCorrectness.explanation",
})[["SessionCorrectness.label", "SessionCorrectness.explanation"]]

eval_results_goal_achievement = eval_results_goal_achievement.rename(columns={
    "label": "GoalCompletion.label",
    "explanation": "GoalCompletion.explanation",
})[["GoalCompletion.label", "GoalCompletion.explanation"]]

eval_results_frustration = eval_results_frustration.rename(columns={
    "label": "Frustration.label",
    "explanation": "Frustration.explanation",
})[["Frustration.label", "Frustration.explanation"]]

# Combine all the evaluation results
combined_eval_results = eval_results_correctness \
    .join(eval_results_goal_achievement, how="outer") \
    .join(eval_results_frustration, how="outer")

# Merge evaluation results with session data
merged_df = pd.merge(sessions_df, combined_eval_results, left_index=True, right_index=True)
merged_df.rename(
    columns={
        "SessionCorrectness.label": "session_eval.SessionCorrectness.label",
        "SessionCorrectness.explanation": "session_eval.SessionCorrectness.explanation",
        "GoalCompletion.label": "session_eval.GoalCompletion.label",
        "GoalCompletion.explanation": "session_eval.GoalCompletion.explanation",
        "Frustration.label": "session_eval.Frustration.label",
        "Frustration.explanation": "session_eval.Frustration.explanation",
    },
    inplace=True,
)

# Get the root span for each session to log the evaluation against
root_spans = (
    primary_df.sort_values("start_time")
    .drop_duplicates(subset=["attributes.session.id"], keep="first")[
        ["attributes.session.id", "context.span_id"]
    ]
)

# Merge to get the root span_id for each session
final_df = pd.merge(
    merged_df,              # left
    root_spans,             # right
    left_on="session_id",   # column in merged_df
    right_on="attributes.session.id",  # column in root_spans
    how="left",
)
final_df = final_df.set_index("context.span_id", drop=False)

# Log evaluations back to Arize
arize_client = Client(space_id=os.environ["SPACE_ID"], api_key=os.environ["API_KEY"])
response = arize_client.log_evaluations_sync(
    dataframe=final_df,
    model_id=model_id,
)

View Results in Arize

After logging the evaluations, you can view the results in the Sessions tab of your Arize project. The evaluation results will populate for each session, allowing you to:

  • Monitor session-level performance metrics

  • Identify patterns in tutor effectiveness

  • Track student satisfaction and engagement

  • Compare different evaluation dimensions across sessions

Last updated

Was this helpful?