Session-Level Evaluations for an AI Tutor
This tutorial shows you how to run session-level evaluations on conversations with an AI tutor using Arize.
Session-level evaluations provide a holistic view of entire interactions, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.
We'll go through the following steps:
Set up tracing for multi-turn AI tutor conversations
Aggregate spans into structured sessions with truncation support
Evaluate sessions across multiple dimensions (Correctness, Goal Completion, Frustration)
Format evaluation outputs to match Arize's schema
Log results back to Arize for monitoring and analysis
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or walkthrough video.
Build AI Tutor with Session Tracking
Create an AI tutor that tracks conversations using using_attributes
:
import uuid
import anthropic
from openinference.instrumentation import using_attributes
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def run_session(user_id: str, topic: str, question: str):
session_id = f"tutor-{uuid.uuid4()}"
chat = [
{"role": "system", "content": (
f"You are a thoughtful AI tutor teaching {topic}. "
"Ask questions, give hints, and only suggest full answers "
"when student shows correct reasoning."
)},
{"role": "user", "content": question},
]
while True:
with using_attributes(session_id=session_id, user_id=user_id):
messages = []
for msg in chat:
if msg["role"] == "system":
if not messages:
messages.append({"role": "user", "content": msg["content"]})
else:
messages.append(msg)
resp = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=messages,
max_tokens=1000,
temperature=0.5,
)
assistant_msg = resp.content[0].text.strip()
assistant_msg += "\n\n(You can type 'DONE' if you're finished.)"
chat.append({"role": "assistant", "content": assistant_msg})
print(f"Tutor: {assistant_msg}")
student_input = input("> your answer: ")
if student_input.strip().upper() == "DONE":
print("✅ Student is DONE — ending session.")
break
chat.append({"role": "user", "content": student_input})
return session_id
Prepare Spans for Session-Level Evaluation
Use the Arize Client to export your spans as a dataframe:
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone
client = ArizeExportClient(api_key=os.environ["API_KEY"])
primary_df = client.export_model_to_df(
space_id=os.environ["SPACE_ID"],
model_id=model_id,
environment=Environments.TRACING,
start_time=datetime.now(timezone.utc) - timedelta(days=7),
end_time=datetime.now(timezone.utc),
)
Group Spans by Session with Truncation
Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the session messages if token limits are exceeded. This prevents context window issues for longer sessions.
import pandas as pd
def truncate_text(text, max_chars, strategy="end"):
"""Truncate text to max_chars using the specified strategy."""
if not text or len(text) <= max_chars:
return text
if strategy == "start":
return "..." + text[-(max_chars - 3):]
elif strategy == "middle":
half = (max_chars - 3) // 2
return text[:half] + "..." + text[-half:]
else: # "end"
return text[:max_chars - 3] + "..."
def estimate_session_size(user_inputs, output_messages):
"""Estimate total character count of session content."""
total_chars = sum(len(msg) for msg in user_inputs + output_messages if isinstance(msg, str))
return total_chars
def prepare_sessions(
df: pd.DataFrame,
max_chars_per_value=5000,
max_chars_per_session=100000,
truncation_strategy="end"
) -> pd.DataFrame:
"""
Collapse spans into a single row per session with truncation support.
"""
sessions = []
# Sort and group
grouped = df.sort_values("start_time").groupby("attributes.session.id", as_index=False)
for session_id, group in grouped:
# Drop NA values and apply per-value truncation
user_inputs = [
truncate_text(msg, max_chars_per_value, truncation_strategy)
for msg in group["attributes.input.value"].dropna().tolist()
]
output_messages = [
truncate_text(msg, max_chars_per_value, truncation_strategy)
for msg in group["attributes.output.value"].dropna().tolist()
]
# Estimate total session size
total_chars = estimate_session_size(user_inputs, output_messages)
# Truncate session-level size if needed
if total_chars > max_chars_per_session:
print(f"Session {session_id} exceeds {max_chars_per_session} chars. Truncating...")
# Keep messages evenly from start and end (half-half)
def smart_truncate(msgs):
keep_half = len(msgs) // 2
return msgs[:keep_half // 2] + msgs[-(keep_half - keep_half // 2):]
user_inputs = smart_truncate(user_inputs)
output_messages = smart_truncate(output_messages)
# Optional: truncate remaining messages again more aggressively
total_chars = estimate_session_size(user_inputs, output_messages)
if total_chars > max_chars_per_session:
aggressive_limit = max_chars_per_value // 2
user_inputs = [truncate_text(m, aggressive_limit, truncation_strategy) for m in user_inputs]
output_messages = [truncate_text(m, aggressive_limit, truncation_strategy) for m in output_messages]
sessions.append({
"session_id": session_id,
"user_inputs": user_inputs,
"output_messages": output_messages,
"trace_count": group["context.trace_id"].nunique(),
})
return pd.DataFrame(sessions)
sessions_df = prepare_sessions(primary_df, truncation_strategy="middle")
Session Evaluations
Session Correctness Evaluation
Evaluate if the AI tutor provides factually accurate and educationally sound responses:
from phoenix.evals import llm_classify, AnthropicModel
import nest_asyncio
nest_asyncio.apply()
SESSION_CORRECTNESS_PROMPT = """
You are an expert tutor assistant evaluating the **correctness and educational quality** of an AI tutor's session with a student.
A session consists of multiple traces (interactions) between a student and an AI tutor. I will provide you with:
1. The student's messages (user inputs) in order.
2. The AI tutor's responses (output messages) in order.
An effective and correct tutoring session should:
- Provide factually and conceptually accurate explanations
- Correctly answer student questions
- Clarify misunderstandings if they occur
- Build upon previous context in a coherent way
- Avoid hallucinations, vague responses, or incorrect reasoning
##
Student Inputs:
{user_inputs}
Tutor Outputs:
{output_messages}
##
Based on the above, evaluate the session **only for correctness and educational soundness**.
Respond with a single word: `correct` or `incorrect`.
- Respond with `correct` if the AI tutor consistently provides accurate, clear, and educationally sound answers.
- Respond with `incorrect` if the AI tutor gives factually wrong, misleading, or incoherent explanations at any point.
"""
# Configure your evaluation model
model = AnthropicModel(
model="claude-3-7-sonnet-latest",
)
# Run the evaluation
rails = ["correct", "incorrect"]
eval_results_correctness = llm_classify(
data=sessions_df,
template=SESSION_CORRECTNESS_PROMPT,
model=model,
rails=rails,
provide_explanation=True,
verbose=False,
)
Session Frustration Evaluation
Evaluate if the student shows signs of frustration during the session:
SESSION_FRUSTRATION_PROMPT = """
You are an AI assistant evaluating whether a student became frustrated during a tutoring session with an AI tutor.
A session consists of multiple traces (interactions) between a student and an AI tutor. You will be given:
1. The student's messages (user inputs), in order.
2. The AI tutor's messages (output messages), in order.
Signs of student frustration may include:
- Repeating or rephrasing the same question multiple times
- Expressing confusion ("I don't get it", "This doesn't make sense", etc.)
- Disagreeing with the tutor's responses
- Asking for clarification frequently without resolution
- Expressing annoyance, impatience, or disengagement
- Abruptly ending the session
##
Student Inputs:
{user_inputs}
Tutor Outputs:
{output_messages}
##
Based on the above, evaluate whether the student showed signs of frustration at any point in the session.
Respond with a single word: `frustrated` or `not_frustrated`.
- Respond with `frustrated` if there is evidence of confusion, dissatisfaction, or emotional frustration.
- Respond with `not_frustrated` if the student appears to stay engaged and satisfied throughout.
"""
# Run the evaluation
rails = ["frustrated", "not_frustrated"]
eval_results_frustration = llm_classify(
data=sessions_df,
template=SESSION_FRUSTRATION_PROMPT,
model=model,
rails=rails,
provide_explanation=True,
verbose=False,
)
Session Goal Achievement Evaluation
Evaluate if the tutor successfully helped the student achieve their learning goals:
SESSION_GOAL_ACHIEVEMENT_PROMPT = """
You are an AI assistant evaluating whether the AI tutor successfully helped the student achieve their learning goals during a tutoring session.
A session consists of multiple interactions between a student and an AI tutor. You will be given:
1. The student's messages (user inputs), in chronological order.
2. The AI tutor's responses (output messages), in chronological order.
To determine if the student's goals were achieved, consider:
- Whether the AI tutor addressed the student's questions and requests directly
- Whether the explanations provided resolved the student's doubts or problems
- Whether the student's inputs indicate understanding or closure by the end
- Whether the conversation logically progressed toward completing the student's objectives
##
Student Inputs:
{user_inputs}
Tutor Outputs:
{output_messages}
##
Evaluate the session and respond with a single word: `achieved` or `not_achieved`.
- Respond with `achieved` if the tutoring session successfully met the student's learning goals and resolved their questions.
- Respond with `not_achieved` if the session left the student's questions unanswered or goals unmet.
"""
# Run the evaluation
rails = ["achieved", "not_achieved"]
eval_results_goal_achievement = llm_classify(
data=sessions_df,
template=SESSION_GOAL_ACHIEVEMENT_PROMPT,
model=model,
rails=rails,
provide_explanation=True,
verbose=False,
)
Log Evaluations Back to Arize
Format and log the evaluation results back to Arize for monitoring:
import pandas as pd
from arize.pandas.logger import Client
# Rename columns to match Arize schema
eval_results_correctness = eval_results_correctness.rename(columns={
"label": "SessionCorrectness.label",
"explanation": "SessionCorrectness.explanation",
})[["SessionCorrectness.label", "SessionCorrectness.explanation"]]
eval_results_goal_achievement = eval_results_goal_achievement.rename(columns={
"label": "GoalCompletion.label",
"explanation": "GoalCompletion.explanation",
})[["GoalCompletion.label", "GoalCompletion.explanation"]]
eval_results_frustration = eval_results_frustration.rename(columns={
"label": "Frustration.label",
"explanation": "Frustration.explanation",
})[["Frustration.label", "Frustration.explanation"]]
# Combine all the evaluation results
combined_eval_results = eval_results_correctness \
.join(eval_results_goal_achievement, how="outer") \
.join(eval_results_frustration, how="outer")
# Merge evaluation results with session data
merged_df = pd.merge(sessions_df, combined_eval_results, left_index=True, right_index=True)
merged_df.rename(
columns={
"SessionCorrectness.label": "session_eval.SessionCorrectness.label",
"SessionCorrectness.explanation": "session_eval.SessionCorrectness.explanation",
"GoalCompletion.label": "session_eval.GoalCompletion.label",
"GoalCompletion.explanation": "session_eval.GoalCompletion.explanation",
"Frustration.label": "session_eval.Frustration.label",
"Frustration.explanation": "session_eval.Frustration.explanation",
},
inplace=True,
)
# Get the root span for each session to log the evaluation against
root_spans = (
primary_df.sort_values("start_time")
.drop_duplicates(subset=["attributes.session.id"], keep="first")[
["attributes.session.id", "context.span_id"]
]
)
# Merge to get the root span_id for each session
final_df = pd.merge(
merged_df, # left
root_spans, # right
left_on="session_id", # column in merged_df
right_on="attributes.session.id", # column in root_spans
how="left",
)
final_df = final_df.set_index("context.span_id", drop=False)
# Log evaluations back to Arize
arize_client = Client(space_id=os.environ["SPACE_ID"], api_key=os.environ["API_KEY"])
response = arize_client.log_evaluations_sync(
dataframe=final_df,
model_id=model_id,
)
View Results in Arize
After logging the evaluations, you can view the results in the Sessions tab of your Arize project. The evaluation results will populate for each session, allowing you to:
Monitor session-level performance metrics
Identify patterns in tutor effectiveness
Track student satisfaction and engagement
Compare different evaluation dimensions across sessions
Last updated
Was this helpful?