Trace-Level Evaluations for a Recommendation Agent

This tutorial shows you how to run trace-level evaluations on a movie recommendation agent using Arize AX.

Trace-level evaluations provide granular insights into individual user interactions, enabling you to assess performance on a per-request basis. This approach is particularly valuable for identifying specific successes and failures in end-to-end system performance.

We'll go through the following steps:

Set up tracing for a movie recommendation agent with OpenAI Agents SDK
Build and capture individual traces representing single user requests
Evaluate each trace across key dimensions (Tool Usage, Recommendation Relevance)
Format evaluation outputs to match Arize's schema
Log results back to Arize AX for monitoring and analysis

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or watch the video above:

Google Colabcolab.research.google.com

Build Movie Recommendation Agent

Create a movie recommendation agent with three specialized tools:

from agents import Agent, Runner, function_tool
from typing import List, Union
from openai import OpenAI
import ast

client = OpenAI()

@function_tool
def movie_selector_llm(genre: str) -> List[str]:
    prompt = (
        f"List up to 5 recent popular streaming movies in the {genre} genre. "
        "Provide only movie titles as a Python list of strings."
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150,
    )
    content = response.choices[0].message.content
    try:
        movie_list = ast.literal_eval(content)
        if isinstance(movie_list, list):
            return movie_list[:5]
    except Exception:
        return content.split('\n')

@function_tool
def reviewer_llm(movies: Union[str, List[str]]) -> str:
    if isinstance(movies, list):
        movies_str = ", ".join(movies)
        prompt = f"Sort the following movies by rating from highest to lowest and provide a short review for each:\n{movies_str}"
    else:
        prompt = f"Provide a short review and rating for the movie: {movies}"
    response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300,
    )
    return response.choices[0].message.content.strip()

@function_tool
def preview_summarizer_llm(movie: str) -> str:
    prompt = f"Write a 1-2 sentence summary describing the movie '{movie}'."
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()

Create and Test the Agent

agent = Agent(
    name="MovieRecommendationAgentLLM",
    tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],
    instructions=(
        "You are a helpful movie recommendation assistant with access to three tools:\n"
        "1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\n"
        "2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\n"
        "3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\n\n"
        "Your goal is to provide a helpful, user-friendly response combining relevant information."
    ),
)

import asyncio

async def main():
    user_input = "Which comedy movie should I watch?"
    result = await Runner.run(agent, user_input)
    print(result.final_output)

await main()

Generate Multiple Traces

Run the agent with various questions to generate multiple traces for evaluation:

questions = [
    "Which Batman movie should I watch?",
    "I want to watch a good romcom",
    "What is a very scary horror movie?",
    "Name a feel-good holiday movie",
    "Recommend a musical with great songs",
    "Give me a classic drama from the 90s"
]

for question in questions:
    result = await Runner.run(agent, question)

Get Span Data from Arize AX

Export your traces from Arize AX and prepare them for evaluation:

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient(api_key=os.environ["ARIZE_API_KEY"])

primary_df = client.export_model_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    model_id=model_id,
    environment=Environments.TRACING,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

import pandas as pd

trace_df = (
    primary_df.groupby("context.trace_id")
      .agg({
          "attributes.input.value": "first",
          "attributes.output.value": lambda x: " ".join(x.dropna()),
      })
)

trace_df.head()

Define and Run Evaluators

Tool Calling Order Evaluation

Evaluate whether the agent uses tools in the correct logical sequence:

from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio

nest_asyncio.apply()

TOOL_CALLING_ORDER = """
You are evaluating the correctness of the tool calling order in an LLM application's trace.

You will be given:
1. The user input that initiated the trace
2. The full trace output, including the sequence of tool calls made by the agent 

##
User Input:
{attributes.input.value}

Trace Output:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` → 
- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively. 
- A proper answer involves calls to reviews, summaries, and recommendations where relevant.
2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.
"""

model = OpenAIModel(
    api_key = os.environ["OPENAI_API_KEY"],
    model   = "gpt-4o-mini",
    temperature = 0.0,
)

rails = ["correct", "incorrect"]

tool_eval_results = llm_classify(
    dataframe           = trace_df,
    template            = TOOL_CALLING_ORDER,
    model               = model,
    rails               = rails,
    provide_explanation = True,   
    verbose             = False,
)

Recommendation Relevance Evaluation

Evaluate whether the movie recommendations match the user's request:

RECOMMENDATION_RELEVANCE = """
You are evaluating the relevance of movie recommendations provided by an LLM application.

You will be given:
1. The user input that initiated the trace
2. The list of movie recommendations output by the system

##
User Input:
{attributes.input.value}

Recommendations:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` → 
- All recommended movies match the requested genre or criteria in the user input. 
- The recommendations should be relevant to the user's request and shouldn't be repetitive.
- `incorrect` → one or more recommendations do not match the requested genre or criteria.
"""

relevance_eval_results = llm_classify(
    dataframe           = trace_df,
    template            = RECOMMENDATION_RELEVANCE,
    model               = model,
    rails               = rails,
    provide_explanation = True,   
    verbose             = False,
)

Log Results Back to Arize AX

Format and log the evaluation results back to Arize AX for monitoring:

from arize.pandas.logger import Client

# Rename columns to match Arize schema
tool_eval_results = tool_eval_results.rename(columns={
    "label": "ToolEvaluation.label",
    "explanation": "ToolEvaluation.explanation",
})[["ToolEvaluation.label", "ToolEvaluation.explanation"]]

relevance_eval_results = relevance_eval_results.rename(columns={
    "label": "RecommendationRelevance.label",
    "explanation": "RecommendationRelevance.explanation",
})[["RecommendationRelevance.label", "RecommendationRelevance.explanation"]]

# Combine evaluation results
combined_eval_results = tool_eval_results \
    .join(relevance_eval_results, how="outer")

# Merge with trace data
merged_df = pd.merge(trace_df, combined_eval_results, left_index=True, right_index=True)
merged_df.rename(
    columns={
        "ToolEvaluation.label": "trace_eval.ToolEvaluation.label",
        "ToolEvaluation.explanation": "trace_eval.ToolEvaluation.explanation",
        "RecommendationRelevance.label": "trace_eval.RecommendationRelevance.label",
        "RecommendationRelevance.explanation": "trace_eval.RecommendationRelevance.explanation",
    },
    inplace=True,
)

# Get root spans for logging
root_spans = primary_df[primary_df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = merged_df.merge(root_spans, on="context.trace_id", how="left")

# Log evaluations back to Arize
arize_client = Client(
    space_id = os.environ["ARIZE_SPACE_ID"],
    api_key  = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
    dataframe = log_df,
    model_id  = model_id,
)

View Results in Arize AX

After logging the evaluations, you can view the results in the Traces tab of your Arize project. The evaluation results will populate for each trace, allowing you to:

Monitor trace-level performance metrics
Identify patterns in agent effectiveness
Track recommendation quality and relevance

Last updated 14 days ago

Was this helpful?