Session-Level Evals

While individual trace evaluations are useful for assessing single interactions, session-level evaluations allow you to analyze the entire lifecycle of a user’s conversation with your AI agent or chatbot. This is crucial for understanding the overall user experience and identifying issues that only emerge over multiple turns. For example, a chatbot might answer a single question correctly but fail to handle a follow-up question, leading to a poor user experience. Session-level evaluations are crucial for assessing:

Coherence: Does the agent maintain a consistent and logical conversation flow?
Context Retention: Does the agent remember and correctly utilize information from earlier in the conversation?
Goal Achievement: Did the user successfully achieve their overall goal by the end of the session?
Task Progression: For multi-step tasks, does the conversation progress logically toward completion?

Session-Level Evaluations via UI
Session-Level Evaluations via Code

Session-Level Evaluations via UI

To run evaluations at the session level in the UI, set the evaluator scope to “Session” for each evaluator you want to operate at that level. You will the evaluation output populate next to each session. You can hover over the evaluation to filter by results or view details like score and explanation.

The attributes required for a Session Eval will vary case by case.When you pass attributes into the evaluator, they are concatenated across all spans in the session that contain them. You can use the Evaluator Data Filter bar to precisely control which spans contribute to the concatenated attributes.For example, if you use attributes.input.value in the Eval Template, the input values from all matching spans in the session are concatenated into a single variable and passed to the evaluation prompt.

Session-Level Evaluations via Code

1. Get session data from Arize

Coming Soon

2. Preparing Session Data for Evaluation

Next, transform the raw trace data into a format suitable for an LLM judge. This function groups all interactions by session.id, orders them chronologically, and concatenates the user inputs and agent outputs for each session.

import pandas as pd
def prepare_sessions(df: pd.DataFrame) -> pd.DataFrame:
    """Collapse spans into a single row per session with ordered user/assistant messages."""
    sessions = []
    # Sort spans chronologically, then group by the session identifier
    grouped = (
        df.sort_values("start_time")
        .groupby("attributes.session.id", as_index=False)
    )

    for session_id, group in grouped:
        sessions.append(
            {
                "session_id": session_id,
                # Collect all user inputs for the session (dropping any nulls)
                "user_inputs": group["attributes.input.value"].dropna().tolist(),
                # Collect all assistant responses for the session (dropping any nulls)
                "output_messages": group["attributes.output.value"].dropna().tolist(),
                # Count how many distinct traces are in this session
                "trace_count": group["context.trace_id"].nunique(),
            }
        )
    return pd.DataFrame(sessions)

# Build the session-level dataframe using the filtered traces from step 2
sessions_df = prepare_sessions(primary_df)

For a more complete, end-to-end example (including optional filtering utilities and logging), see the Session Evals Notebook.

3. Define your evaluation prompt

The core of the evaluation is a carefully designed prompt that instructs an LLM on how to assess session-level quality. The prompt should ask the LLM to evaluate coherence, context utilization, and goal progression.Here is an example prompt template. You can customize it for your specific needs.

SESSION_CORRECTNESS_PROMPT = """
You are a helpful AI bot that evaluates the effectiveness and correctness of an AI agent's session.

A session consists of multiple traces (interactions) between a user and an AI system. I will provide you with:
1. The user inputs that initiated each trace in the session, in chronological order.
2. The AI's output messages for each trace in the session, in chronological order.

An effective and correct session:
- Shows consistent understanding of user intentions across traces
- Maintains context and coherence between interactions
- Successfully achieves the overall user goals
- Builds upon previous interactions in the conversation

##
User Inputs:
{user_inputs}

Output Messages:
{output_messages}
##

Evaluate the session based on the given criteria. Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.

- Respond with `correct` if the session effectively accomplishes user goals with appropriate responses and coherence.
- Respond with `incorrect` if the session shows confusion, inappropriate responses, or fails to accomplish user goals.
"""

4. Run the evaluation

With the data prepared and the prompt defined, use phoenix.evals.llm_classify to execute the evaluation. This will send each session’s data to the model specified and classify it as correct or incorrect.

from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

correctness_evaluator = create_classifier(
    name="correctness",
    llm=llm,
    prompt_template=SESSION_CORRECTNESS_PROMPT,
    choices={"correct": 1.0, "incorrect": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=sessions_df,
    evaluators=[correctness_evaluator],
)

5. Log the results back to Arize

Finally, merge the evaluation results back with your session data and log them to Arize. The evaluation results are logged as attributes on the root span of the first trace in each session. This allows you to track session-level performance directly within Arize AX.

Coming Soon

Once logged, you can view these session-level evaluations in the Arize AX UI and use them to analyze session performance, identify problematic conversation patterns, and improve your AI agent’s multi-turn capabilities.

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Session-Level Evaluations via UI

Session-Level Evaluations via Code

1. Get session data from Arize

2. Preparing Session Data for Evaluation

3. Define your evaluation prompt

4. Run the evaluation

5. Log the results back to Arize

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Session-Level Evaluations via UI

​Session-Level Evaluations via Code

​1. Get session data from Arize

​2. Preparing Session Data for Evaluation

​3. Define your evaluation prompt

​4. Run the evaluation

​5. Log the results back to Arize

Session-Level Evaluations via UI

Session-Level Evaluations via Code

1. Get session data from Arize

2. Preparing Session Data for Evaluation

3. Define your evaluation prompt

4. Run the evaluation

5. Log the results back to Arize