Session-Level Evaluations

Evaluate entire user conversations to measure coherence, context retention, and overall goal achievement in your LLM applications.

While individual trace evaluations are useful for assessing single interactions, session-level evaluations allow you to analyze the entire lifecycle of a user's conversation with your AI agent or chatbot. This is crucial for understanding the overall user experience and identifying issues that only emerge over multiple turns.

Session-level evaluations are crucial for assessing:

Coherence: Does the agent maintain a consistent and logical conversation flow?
Context Retention: Does the agent remember and correctly utilize information from earlier in the conversation?
Goal Achievement: Did the user successfully achieve their overall goal by the end of the session?
Task Progression: For multi-step tasks, does the conversation progress logically toward completion?

A view of the sessions tab in Arize, showing multiple traces grouped by a session ID.

Why Use Session-Level Evaluations?

Evaluating entire sessions helps you uncover insights that are invisible at the single-prompt level. For example, a chatbot might answer a single question correctly but fail to handle a follow-up question, leading to a poor user experience.

By evaluating the entire session, you can:

Pinpoint where and why a conversation went off-track
Identify patterns of user frustration or confusion over a series of interactions
Measure and improve metrics related to overall user satisfaction and task success rates
Validate the performance of complex, multi-turn agents

How It Works

Session-level evaluations work by grouping all traces that share the same session.id. The content from these traces (e.g., all user inputs and agent outputs) is aggregated and then assessed by an LLM-as-a-judge against a set of criteria you define.

The resulting evaluation (e.g., a "correctness" label and an explanation) is then attached to the first trace in the session, allowing you to filter for and analyze high- or low-performing sessions directly within Arize.

The complete source code for this guide is available in the Session Evals Notebook.

Prerequisites

Install required libraries: If you don't already have the necessary libraries, install them with pip:

pip install arize arize-phoenix pandas openai nest_asyncio arize-phoenix-evals

Instrumented traces: You'll need an application with instrumented traces. See Tracing
Session IDs: Your traces must be instrumented with session.id attributes to group interactions into sessions. Refer to the Sessions documentation for instructions on how to add session and user IDs to your spans.

Implementation

The process involves extracting session data from Arize, preparing it, running an LLM-as-a-judge evaluation, and logging the results back to your Arize project.

Data Extraction

First, pull the trace data for your model from Arize using the ArizeExportClient. You will need to specify your Space ID, Model ID, and a time range.

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient(api_key="YOUR_API_KEY")

primary_df = client.export_model_to_df(
    space_id="YOUR_SPACE_ID",
    model_id="YOUR_MODEL_ID",
    environment=Environments.TRACING,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

[Optional] Filtering for Relevant Sessions

To evaluate complete conversational flows, it's best to filter for sessions that contain specific traces of interest and then pull all traces belonging to those sessions.

This helper function identifies sessions containing traces that meet your criteria (e.g., traces with a specific tool call, or parent spans only) and returns all traces from those matching sessions.

from typing import Dict, Any
import pandas as pd

# This helper function is defined in the accompanying notebook.
# It allows filtering sessions based on trace or span attributes.
from session_eval_utils import filter_sessions_by_trace_criteria

# Example: Filter for sessions that have root spans (spans with no parent)
# This helps focus on the top-level interactions.
eval_traces = filter_sessions_by_trace_criteria(
    df=primary_df,
    span_filters={"parent_id": {"==": None }}
)

Preparing Session Data for Evaluation

Next, transform the raw trace data into a format suitable for an LLM judge. This function groups all interactions by session.id, orders them chronologically, and concatenates the user inputs and agent outputs for each session.

import pandas as pd


def prepare_sessions(df: pd.DataFrame) -> pd.DataFrame:
    """Collapse spans into a single row per session with ordered user/assistant messages."""
    sessions = []

    # Sort spans chronologically, then group by the session identifier
    grouped = (
        df.sort_values("start_time")
        .groupby("attributes.session.id", as_index=False)
    )

    for session_id, group in grouped:
        sessions.append(
            {
                "session_id": session_id,
                # Collect all user inputs for the session (dropping any nulls)
                "user_inputs": group["attributes.input.value"].dropna().tolist(),
                # Collect all assistant responses for the session (dropping any nulls)
                "output_messages": group["attributes.output.value"].dropna().tolist(),
                # Count how many distinct traces are in this session
                "trace_count": group["context.trace_id"].nunique(),
            }
        )

    return pd.DataFrame(sessions)


# Build the session-level dataframe using the filtered traces from step 2
sessions_df = prepare_sessions(eval_traces)

For a more complete, end-to-end example (including optional filtering utilities and logging), see the Session Evals Notebook.

Designing the Evaluation Prompt

The core of the evaluation is a carefully designed prompt that instructs an LLM on how to assess session-level quality. The prompt should ask the LLM to evaluate coherence, context utilization, and goal progression.

Here is an example prompt template. You can customize it for your specific needs.

SESSION_CORRECTNESS_PROMPT = """
You are a helpful AI bot that evaluates the effectiveness and correctness of an AI agent's session.

A session consists of multiple traces (interactions) between a user and an AI system. I will provide you with:
1. The user inputs that initiated each trace in the session, in chronological order.
2. The AI's output messages for each trace in the session, in chronological order.

An effective and correct session:
- Shows consistent understanding of user intentions across traces
- Maintains context and coherence between interactions
- Successfully achieves the overall user goals
- Builds upon previous interactions in the conversation

##
User Inputs:
{user_inputs}

Output Messages:
{output_messages}
##

Evaluate the session based on the given criteria. Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.

- Respond with `correct` if the session effectively accomplishes user goals with appropriate responses and coherence.
- Respond with `incorrect` if the session shows confusion, inappropriate responses, or fails to accomplish user goals.
"""

Running the Evaluation

With the data prepared and the prompt defined, use phoenix.evals.llm_classify to execute the evaluation. This will send each session's data to the model specified and classify it as correct or incorrect.

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio

nest_asyncio.apply()

# Configure your evaluation model (e.g., GPT-4o-mini)
model = OpenAIModel(
    api_key="YOUR_OPENAI_API_KEY",
    model="gpt-4o-mini",
)

# Run the evaluation
rails = ["correct", "incorrect"]
eval_results = llm_classify(
    data=sessions_df,
    template=SESSION_CORRECTNESS_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

Analyzing and Logging Results

Finally, merge the evaluation results back with your session data and log them to Arize. The evaluation results are logged as attributes on the root span of the first trace in each session. This allows you to track session-level performance directly within the Arize platform.

import pandas as pd
from arize.pandas.logger import Client

# Merge evaluation results with session data
merged_df = pd.merge(sessions_df, eval_results, left_index=True, right_index=True)
merged_df.rename(
    columns={
        "label": "session_eval.SessionCorrectness.label",
        "explanation": "session_eval.SessionCorrectness.explanation",
    },
    inplace=True,
)

# Get the root span for each session to log the evaluation against
root_spans = (
    primary_df.sort_values("start_time")
    .drop_duplicates(subset=["attributes.session.id"], keep="first")[
        ["attributes.session.id", "context.span_id"]
    ]
)

# Merge to get the root span_id for each session
final_df = pd.merge(
    merged_df,              # left
    root_spans,             # right
    left_on="session_id",   # column in merged_df
    right_on="attributes.session.id",  # column in root_spans
    how="left",
)
final_df = final_df.set_index("context.span_id", drop=False)

# Log evaluations back to Arize
arize_client = Client(space_id="YOUR_SPACE_ID", api_key="YOUR_API_KEY")
response = arize_client.log_evaluations_sync(
    dataframe=final_df,
    model_id="YOUR_MODEL_ID",
)

Once logged, you can view these session-level evaluations in the Arize UI and use them to analyze session performance, identify problematic conversation patterns, and improve your AI agent's multi-turn capabilities.

Last updated 7 days ago

Was this helpful?