Session-Level Evals
Evaluate entire user conversations to measure coherence, context retention, and overall goal achievement in your LLM applications.
While individual trace evaluations are useful for assessing single interactions, session-level evaluations allow you to analyze the entire lifecycle of a user's conversation with your AI agent or chatbot.
This is crucial for understanding the overall user experience and identifying issues that only emerge over multiple turns. For example, a chatbot might answer a single question correctly but fail to handle a follow-up question, leading to a poor user experience.
Session-level evaluations are crucial for assessing:
Coherence: Does the agent maintain a consistent and logical conversation flow?
Context Retention: Does the agent remember and correctly utilize information from earlier in the conversation?
Goal Achievement: Did the user successfully achieve their overall goal by the end of the session?
Task Progression: For multi-step tasks, does the conversation progress logically toward completion?
To run evaluations at the trace level in the UI, set the evaluator scope to “Session” for each evaluator you want to operate at that level. You will the evaluation output populate next to each session. You can hover over the evaluation to filter by results or view details like score and explanation.
Session-Level Evaluations via Code
1. Get session data from Arize
First, pull the trace data for your model from Arize using the ArizeExportClient. You will need to specify your Space ID, Model ID, and a time range.
pip install arize arize-otel arize-phoenix-evalsfrom arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone
client = ArizeExportClient(api_key="ARIZE_API_KEY")
primary_df = client.export_model_to_df(
space_id="ARIZE_SPACE_ID",
model_id="YOUR_MODEL_ID",
environment=Environments.TRACING,
start_time=datetime.now(timezone.utc) - timedelta(days=7),
end_time=datetime.now(timezone.utc),
)2. Preparing Session Data for Evaluation
Next, transform the raw trace data into a format suitable for an LLM judge. This function groups all interactions by session.id, orders them chronologically, and concatenates the user inputs and agent outputs for each session.
import pandas as pd
def prepare_sessions(df: pd.DataFrame) -> pd.DataFrame:
"""Collapse spans into a single row per session with ordered user/assistant messages."""
sessions = []
# Sort spans chronologically, then group by the session identifier
grouped = (
df.sort_values("start_time")
.groupby("attributes.session.id", as_index=False)
)
for session_id, group in grouped:
sessions.append(
{
"session_id": session_id,
# Collect all user inputs for the session (dropping any nulls)
"user_inputs": group["attributes.input.value"].dropna().tolist(),
# Collect all assistant responses for the session (dropping any nulls)
"output_messages": group["attributes.output.value"].dropna().tolist(),
# Count how many distinct traces are in this session
"trace_count": group["context.trace_id"].nunique(),
}
)
return pd.DataFrame(sessions)
# Build the session-level dataframe using the filtered traces from step 2
sessions_df = prepare_sessions(primary_df)For a more complete, end-to-end example (including optional filtering utilities and logging), see the Session Evals Notebook.
3. Define your evaluation prompt
The core of the evaluation is a carefully designed prompt that instructs an LLM on how to assess session-level quality. The prompt should ask the LLM to evaluate coherence, context utilization, and goal progression.
Here is an example prompt template. You can customize it for your specific needs.
SESSION_CORRECTNESS_PROMPT = """
You are a helpful AI bot that evaluates the effectiveness and correctness of an AI agent's session.
A session consists of multiple traces (interactions) between a user and an AI system. I will provide you with:
1. The user inputs that initiated each trace in the session, in chronological order.
2. The AI's output messages for each trace in the session, in chronological order.
An effective and correct session:
- Shows consistent understanding of user intentions across traces
- Maintains context and coherence between interactions
- Successfully achieves the overall user goals
- Builds upon previous interactions in the conversation
##
User Inputs:
{user_inputs}
Output Messages:
{output_messages}
##
Evaluate the session based on the given criteria. Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.
- Respond with `correct` if the session effectively accomplishes user goals with appropriate responses and coherence.
- Respond with `incorrect` if the session shows confusion, inappropriate responses, or fails to accomplish user goals.
"""4. Run the evaluation
With the data prepared and the prompt defined, use phoenix.evals.llm_classify to execute the evaluation. This will send each session's data to the model specified and classify it as correct or incorrect.
from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
correctness_evaluator = create_classifier(
name="correctness",
llm=llm,
prompt_template=SESSION_CORRECTNESS_PROMPT,
choices={"correct": 1.0, "incorrect": 0.0},
)
results_df = await async_evaluate_dataframe(
dataframe=sessions_df,
evaluators=[correctness_evaluator],
)5. Log the results back to Arize
Finally, merge the evaluation results back with your session data and log them to Arize. The evaluation results are logged as attributes on the root span of the first trace in each session. This allows you to track session-level performance directly within Arize AX.
from arize.pandas.logger import Client
from phoenix.evals.utils import to_annotation_dataframe
import ast
import pandas as pd
client = Client()
root_spans = primary_df[primary_df["parent_id"].isna()][["attributes.session.id", "context.span_id"]]
results_with_spans = pd.merge(
results_df.reset_index(),
root_spans,
left_on="session_id",
right_on="attributes.session.id",
how="left"
).set_index("context.span_id", drop=False)
# Format for logging
correctness_eval_df = to_annotation_dataframe(results_with_spans)
# using trace_eval prefix to rename columns
correctness_eval_df = correctness_eval_df.rename(columns={
"label": "session_eval.correctness.label",
"score": "session_eval.correctness.score",
"explanation": "session_eval.correctness.explanation"
})
client.log_evaluations_sync(correctness_eval_df, 'your-project-name')Once logged, you can view these session-level evaluations in the Arize AX UI and use them to analyze session performance, identify problematic conversation patterns, and improve your AI agent's multi-turn capabilities.
Last updated
Was this helpful?

