Session-Level Evaluations

This guide demonstrates how to evaluate the effectiveness of AI agent interactions at the session level, where a session consists of multiple traces (individual interactions) between a user and the system.

Session-level evaluations are crucial for assessing:

Coherence across multiple interactions
Context retention between interactions
Overall goal achievement across an entire conversation
Appropriate progression through complex multi-step tasks

The complete source code for this guide is available in the Session Evals Notebook.

Prerequisites

If you don't already have the necessary libraries, install them with pip:

pip install arize arize-phoenix pandas openai nest_asyncio arize-phoenix-evals

You'll need an application with instrumented traces. See Tracing
Your traces must be instrumented with session.id attributes to group interactions into sessions. Refer to the Sessions documentation for instructions on how to add session and user IDs to your spans.

Implementation

The process involves extracting session data from Arize, preparing it, running an LLM-as-a-judge evaluation, and logging the results back to your Arize project.

1. Data Extraction

First, pull the trace data for your model from Arize using the ArizeExportClient. You will need to specify your Space ID, Model ID, and a time range.

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient(api_key="YOUR_API_KEY")

primary_df = client.export_model_to_df(
    space_id="YOUR_SPACE_ID",
    model_id="YOUR_MODEL_ID",
    environment=Environments.TRACING,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

2. [Optional] Filtering for Relevant Sessions

To evaluate complete conversational flows, it's best to filter for sessions that contain specific traces of interest and then pull all traces belonging to those sessions.

This helper function identifies sessions containing traces that meet your criteria (e.g., traces with a specific tool call, or parent spans only) and returns all traces from those matching sessions.

from typing import Dict, Any
import pandas as pd

# This helper function is defined in the accompanying notebook.
# It allows filtering sessions based on trace or span attributes.
from session_eval_utils import filter_sessions_by_trace_criteria

# Example: Filter for sessions that have root spans (spans with no parent)
# This helps focus on the top-level interactions.
eval_traces = filter_sessions_by_trace_criteria(
    df=primary_df,
    span_filters={"parent_id": {"==": None }}
)

3. Preparing Session Data for Evaluation

Next, transform the raw trace data into a format suitable for an LLM judge. This function groups all interactions by session.id, orders them chronologically, and concatenates the user inputs and agent outputs for each session.

import pandas as pd


def prepare_sessions(df: pd.DataFrame) -> pd.DataFrame:
    """Collapse spans into a single row per session with ordered user/assistant messages."""
    sessions = []

    # Sort spans chronologically, then group by the session identifier
    grouped = (
        df.sort_values("start_time")
        .groupby("attributes.session.id", as_index=False)
    )

    for session_id, group in grouped:
        sessions.append(
            {
                "session_id": session_id,
                # Collect all user inputs for the session (dropping any nulls)
                "user_inputs": group["attributes.input.value"].dropna().tolist(),
                # Collect all assistant responses for the session (dropping any nulls)
                "output_messages": group["attributes.output.value"].dropna().tolist(),
                # Count how many distinct traces are in this session
                "trace_count": group["context.trace_id"].nunique(),
            }
        )

    return pd.DataFrame(sessions)


# Build the session-level dataframe using the filtered traces from step 2
sessions_df = prepare_sessions(eval_traces)

For a more complete, end-to-end example (including optional filtering utilities and logging), see the Session Evals Notebook.

4. Designing the Evaluation Prompt

The core of the evaluation is a carefully designed prompt that instructs an LLM on how to assess session-level quality. The prompt should ask the LLM to evaluate coherence, context utilization, and goal progression.

Here is an example prompt template. You can customize it for your specific needs.

SESSION_CORRECTNESS_PROMPT = """
You are a helpful AI bot that evaluates the effectiveness and correctness of an AI agent's session.

A session consists of multiple traces (interactions) between a user and an AI system. I will provide you with:
1. The user inputs that initiated each trace in the session, in chronological order.
2. The AI's output messages for each trace in the session, in chronological order.

An effective and correct session:
- Shows consistent understanding of user intentions across traces
- Maintains context and coherence between interactions
- Successfully achieves the overall user goals
- Builds upon previous interactions in the conversation

##
User Inputs:
{user_inputs}

Output Messages:
{output_messages}
##

Evaluate the session based on the given criteria. Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.

- Respond with `correct` if the session effectively accomplishes user goals with appropriate responses and coherence.
- Respond with `incorrect` if the session shows confusion, inappropriate responses, or fails to accomplish user goals.
"""

5. Running the Evaluation

With the data prepared and the prompt defined, use phoenix.evals.llm_classify to execute the evaluation. This will send each session's data to the model specified and classify it as correct or incorrect.

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio

nest_asyncio.apply()

# Configure your evaluation model (e.g., GPT-4o-mini)
model = OpenAIModel(
    api_key="YOUR_OPENAI_API_KEY",
    model="gpt-4o-mini",
)

# Run the evaluation
rails = ["correct", "incorrect"]
eval_results = llm_classify(
    data=sessions_df,
    template=SESSION_CORRECTNESS_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

6. Analyzing and Logging Results

Finally, merge the evaluation results back with your session data and log them to Arize. The evaluation results are logged as attributes on the root span of the first trace in each session. This allows you to track session-level performance directly within the Arize platform.

import pandas as pd
from arize.pandas.logger import Client

# Merge evaluation results with session data
merged_df = pd.merge(sessions_df, eval_results, left_index=True, right_index=True)
merged_df.rename(
    columns={
        "label": "session_eval.SessionCorrectness.label",
        "explanation": "session_eval.SessionCorrectness.explanation",
    },
    inplace=True,
)

# Get the root span for each session to log the evaluation against
root_spans = (
    primary_df.sort_values("start_time")
    .drop_duplicates(subset=["attributes.session.id"], keep="first")[
        ["attributes.session.id", "context.span_id"]
    ]
)

# Merge to get the root span_id for each session
final_df = pd.merge(
    merged_df,              # left
    root_spans,             # right
    left_on="session_id",   # column in merged_df
    right_on="attributes.session.id",  # column in root_spans
    how="left",
)
final_df = final_df.set_index("context.span_id", drop=False)

# Log evaluations back to Arize
arize_client = Client(space_id="YOUR_SPACE_ID", api_key="YOUR_API_KEY")
response = arize_client.log_evaluations_sync(
    dataframe=final_df,
    model_id="YOUR_MODEL_ID",
)

Once logged, you can view these session-level evaluations in the Arize UI.

Was this helpful?