1 of 26

Cookbooks

Featured Tutorials

Agent Cookbooks

Tracing and Evaluating Agents

Agent Demos

Example agents are fully instrumented with OpenInference and utilize end-to-end tracing with Phoenix for comprehensive performance analysis. Enter your Phoenix and OpenAI keys to view traces.

Tracing

Cookbooks

Trace through the execution of your LLM application to understand its internal structure and to troubleshoot issues with retrieval, tool execution, LLM calls, and more.

Structured Data Extraction

Framework

Example notebook

Open AI Functions

Overview

Data extraction tasks using LLMs, such as scraping text from documents or pulling key information from paragraphs, are on the rise. Using an LLM for this task makes sense - LLMs are great at inherently capturing the structure of language, so extracting that structure from text using LLM prompting is a low cost, high scale method to pull out relevant data from unstructured text.

Structured Extraction at a Glance

LLM Input: Unstructured text + schema + system message

LLM Output: Response based on provided text + schema

Evaluation Metrics:

Did the LLM extract the text correctly? (correctness)

One approach is using a flattened schema. Let's say you're dealing with extracting information for a trip planning application. The query may look something like:

User: I need a budget-friendly hotel in San Francisco close to the Golden Gate Bridge for a family vacation. What do you recommend?

As the application designer, the schema you may care about here for downstream usage could be a flattened representation looking something like:

{
    budget: "low",
    location: "San Francisco",
    purpose: "pleasure"
}

With the above extracted attributes, your downstream application can now construct a structured query to find options that might be relevant to the user.

Implementing a structured extraction application

Structured extraction is a place where it’s simplest to work directly with the OpenAI function calling API. Open AI functions for structured data extraction recommends providing the following JSON schema object in the form ofparameters_schema(the desired fields for structured data output).

parameters_schema = {
    "type": "object",
    "properties": {
        "location": {
            "type": "string",
            "description": 'The desired destination location. Use city, state, and country format when possible. If no destination is provided, return "unstated".',
        },
        "budget_level": {
            "type": "string",
            "enum": ["low", "medium", "high", "not_stated"],
            "description": 'The desired budget level. If no budget level is provided, return "not_stated".',
        },
        "purpose": {
            "type": "string",
            "enum": ["business", "pleasure", "other", "non_stated"],
            "description": 'The purpose of the trip. If no purpose is provided, return "not_stated".',
        },
    },
    "required": ["location", "budget_level", "purpose"],
}
function_schema = {
    "name": "record_travel_request_attributes",
    "description": "Records the attributes of a travel request",
    "parameters": parameters_schema,
}
system_message = (
    "You are an assistant that parses and records the attributes of a user's travel request."
)

The ChatCompletion call to Open AI would look like

response = openai.ChatCompletion.create(
    model=model,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": travel_request},
    ],
    functions=[function_schema],
    # By default, the LLM will choose whether or not to call a function given the conversation context.
    # The line below forces the LLM to call the function so that the output conforms to the schema.
    function_call={"name": function_schema["name"]},
)

Inspecting structured extraction with Phoenix

You can use phoenix spans and traces to inspect the invocation parameters of the function to

verify the inputs to the model in form of the the user message
verify your request to Open AI
verify the corresponding generated outputs from the model match what's expected from the schema and are correct

Evaluating the Extraction Performance

Point level evaluation is a great starting point, but verifying correctness of extraction at scale or in a batch pipeline can be challenging and expensive. Evaluating data extraction tasks performed by LLMs is inherently challenging due to factors like:

The diverse nature and format of source data.
The potential absence of a 'ground truth' for comparison.
The intricacies of context and meaning in extracted data.

To learn more about how to evaluate structured extraction applications, head to our documentation on LLM assisted evals!

Annotations

Using Human Annotations for Eval-Driven Development

How to leverage human annotations to build evaluations and experiments that improve your system

In this tutorial, we will explore how to build a custom human annotation interface for Phoenix using . We will then leverage those annotations to construct experiments and evaluate your application.

The purpose of a custom annotations UI is to make it easy for anyone to provide structured human feedback on traces, capturing essential details directly in Phoenix. Annotations are vital for collecting feedback during human review, enabling iterative improvement of your LLM applications.

By establishing this feedback loop and an evaluation pipeline, you can effectively monitor and enhance your system’s performance.

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or video above.

Generate traces to annotate

We will generate some LLM traces and send them to Phoenix. We will then annotate these traces to add labels, scores, or explanations directly onto specific spans.

We deliberately generate some bad or nonsensical traces in the system prompt to demonstrate annotating and experimenting with different types of results.

Launch Custom Annotation UI

Visit our implementation here:

Note: This annotation UI was built for Phoenix Cloud demo purposes and is not optimized for high-volume trace workflows.

How to annotate your traces in Lovable:

Enter your Phoenix Cloud endpoint, API key, and project name. Optionally, also include an identifier to tie annotations to a specific user.
Click Refresh Traces.
Select the traces you want to annotate and click Send to Phoenix.
See your annotations appear instantly in Phoenix.

This tool was built using the Phoenix . For more details on how to build your own custom annotations tool to fit your needs, see .

Create a dataset from annotated spans

Build an Eval based on annotations

Next, you will construct an LLM-as-a-Judge template to evaluate your experiments. This evaluator will mark nonsensical outputs as incorrect. As you experiment, you’ll see evaluation results improve. Once your annotated trace dataset shows consistent improvement, you can confidently apply these changes to your production system.

Experimentation Example: Improving the System Prompt

The next step is to form a hypothesis about why some outputs are failing. In our full walkthrough, we demonstrate the experimentation process by testing out different hypotheses such as swapping out models. However, for demonstration purposes, we will show an experiment that will almost certainly improve your results: modifying the weak system prompt we originally used.

Here, we expect to see improvements in our experiment. The evaluator should flag significantly fewer nonsensical answers as you have refined your system prompt.

Applying Improvements

Now that we’ve completed a successful experimentation cycle and confirmed our improvements on the annotated traces dataset, we can update the application and test the results on the broader dataset. This helps ensure that improvements made during experimentation translate effectively to real-world usage and that your system performs reliably at scale.

Tips for building your custom annotation UI

Here is a sample prompt you can feed into (or a similar tool) to start building your custom LLM trace annotation interface. Feel free to adjust it to your needs. Note that you will need to implement functionality to fetch spans and send annotations to Phoenix. We’ve also included a brief explanation of how we approached this in our own implementation. A tool like this can benefit teams that want to collect human annotation data without requiring annotators to work directly within the Phoenix platform. You can also configure features like “thumbs up” and “thumbs down” buttons to streamline filling in annotation fields. Once submitted, the annotations immediately appear in Phoenix.

Prompt for Lovable:

Build a platform for annotating LLM spans and traces:

Connect to Phoenix Cloud by collecting endpoint, API Key, and project name from the user
Load traces and spans from Phoenix (via or ).
Display spans grouped by trace_id, with clear visual separation.
Allow annotators to assign a label, score, and explanation to each span or entire trace.
Support sending annotations back to Phoenix and reloading to see updates.
Use a clean, modern design

Details on how we built our Annotation UI:

✅ Frontend (Lovable):

Built in Lovable for easy UI generation.
Allows loading LLM traces, displaying spans grouped by trace_id, and annotating spans with label, score, explanation.

✅ Backend (Render, FastAPI):

Hosted on Render using FastAPI.
Adds CORS for your Lovable frontend to communicate securely.
Uses two key endpoints:
1. GET /v1/projects/{project_identifier}/spans
2. POST /v1/span_annotations

Prompt Engineering

Few Shot Prompting

Few-shot prompting is a powerful technique in prompt engineering that helps LLMs perform tasks more effectively by providing a few examples within the prompt.

Unlike zero-shot prompting, where the model must infer the task with no prior context, or one-shot prompting, where a single example is provided, few-shot prompting leverages multiple examples to guide the model’s responses more accurately.

In this tutorial you will:

Explore how different prompting strategies impact performance in a sentiment analysis task on a dataset of reviews.
Run an evaluation to measure how the prompt affects the model’s performance
Track your how your prompt and experiment changes overtime in Phoenix

By the end of this tutorial, you’ll have a clear understanding of how structured prompting can significantly enhance the results of any application.

⚠️You will need an OpenAI Key for this tutorial.

Let’s get started! 🚀

Setup Dependencies and Keys

!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.

import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset Into Phoenix

This dataset contains reviews along with their corresponding sentiment labels. Throughout this notebook, we will use the same dataset to evaluate the impact of different prompting techniques, refining our approach with each iteration.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

from datasets import load_dataset

ds = load_dataset("syeddula/fridgeReviews")["train"]
ds = ds.to_pandas()
ds.head()

Set up Phoenix Client

import uuid

import phoenix as px
from phoenix.client import Client as PhoenixClient

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Review"],
    output_keys=["Sentiment"],
    dataset_name=f"review-classification-{unique_id}",
)

Zero-Shot Prompting

Zero-shot prompting is a technique where a language model is asked to perform a task without being given any prior examples. Instead, the model relies solely on its pre-trained knowledge to generate a response. This approach is useful when you need quick predictions without providing specific guidance.

In this section, we will apply zero-shot prompting to our sentiment analysis dataset, asking the model to classify reviews as positive, negative, or neutral without any labeled examples. We’ll then evaluate its performance to see how well it can infer the task based on the prompt alone.

from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

from phoenix.client.types import PromptVersion

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications.",
        },
        {"role": "user", "content": "{{Review}}"},
    ],
)

prompt_identifier = "fridge-sentiment-reviews"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="A prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.

Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.

Next we will define a task and evaluator for the experiment.

Because our dataset has ground truth labels, we can use a simple function to check if the output of the task matches the expected output.

def zero_shot_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(**prompt.format(variables={"Review": input["Review"]}))
    return resp.choices[0].message.content.strip()


def evaluate_response(output, expected):
    return output.lower() == expected["Sentiment"].lower()

If you’d like to instrument your code, you can run the cell below. While this step isn’t required for running prompts and evaluations, it enables trace visualization for deeper insights into the model’s behavior.

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register(project_name="few-shot-examples")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Finally, we run our experiment. We can view the results of the experiment in Phoenix.

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset,
    task=zero_shot_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot Prompt",
    experiment_name="zero-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

In the following sections, we refine the prompt to enhance the model's performance and improve the evaluation results on our dataset.

One-Shot Prompting

One-shot prompting provides the model with a single example to guide its response. By including a labeled example in the prompt, we give the model a clearer understanding of the task, helping it generate more accurate predictions compared to zero-shot prompting.

In this section, we will apply one-shot prompting to our sentiment analysis dataset by providing one labeled review as a reference. We’ll then evaluate how this small amount of guidance impacts the model’s ability to classify sentiments correctly.

ds = load_dataset("syeddula/fridgeReviews")["test"]
one_shot_example = ds.to_pandas().sample(1)

one_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."

Here is one example of a review and the sentiment:

{examples}
"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": one_shot_template.format(examples=one_shot_example)},
        {"role": "user", "content": "{{Review}}"},
    ],
)

one_shot_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="One-shot prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

Under the prompts tab in Phoenix, we can see that our prompt has an updated version. The prompt includes one random example from the test dataset to help the model make its classification.

Similar to the previous step, we will define the task and run the evaluator. This time, we will be using our updated prompt for One-Shot Prompting and see how the evaluation changes.

def one_shot_prompt_template(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **one_shot_prompt.format(variables={"Review": input["Review"]})
    )
    return resp.choices[0].message.content.strip()

one_shot_experiment = run_experiment(
    dataset,
    task=one_shot_prompt_template,
    evaluators=[evaluate_response],
    experiment_description="One-Shot Prompting",
    experiment_name="one-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + one_shot_prompt.id},
)

In this run, we observe a slight improvement in the evaluation results. Let’s see if we can further enhance performance in the next section.

Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM, the randomness of selected test examples, and other inherent model behaviors.

Few-Shot Prompting

Finally, we will explore few-shot Prompting which enhances a model’s performance by providing multiple labeled examples within the prompt. By exposing the model to several instances of the task, it gains a better understanding of the expected output, leading to more accurate and consistent responses.

In this section, we will apply few-shot prompting to our sentiment analysis dataset by including multiple labeled reviews as references. This approach helps the model recognize patterns and improves its ability to classify sentiments correctly. We’ll then evaluate its performance to see how additional examples impact accuracy compared to zero-shot and one-shot prompting.

ds = load_dataset("syeddula/fridgeReviews")["test"]
few_shot_examples = ds.to_pandas().sample(10)

few_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."

Here are examples of a review and the sentiment:

{examples}
"""
params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": few_shot_template.format(examples=few_shot_examples)},
        {"role": "user", "content": "{{Review}}"},
    ],
)

few_shot_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Few-shot prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

Our updated prompt also lives in Phoenix. We can clearly see how the linear version history of our prompt was built.

Just like previous steps, we run our task and evaluation.

def few_shot_prompt_template(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **few_shot_prompt.format(variables={"Review": input["Review"]})
    )
    return resp.choices[0].message.content.strip()

few_shot_experiment = run_experiment(
    dataset,
    task=few_shot_prompt_template,
    evaluators=[evaluate_response],
    experiment_description="Few Shot Prompting",
    experiment_name="few-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + few_shot_prompt.id},
)

Final Results

In this final run, we observe the most significant improvement in evaluation results. By incorporating multiple examples into our prompt, we provide clearer guidance to the model, leading to better sentiment classification.

Note: Performance may still vary, and in some cases, results might decline. Like before, this is not necessarily "wrong," as factors like the choice of LLM, the randomness of selected test examples, and inherent model behaviors can all influence outcomes.

From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️

ReAct Prompting

ReAct (Reasoning + Acting) is a prompting technique that enables LLMs to think step-by-step before taking action. Unlike traditional prompting, where a model directly provides an answer, ReAct prompts guide the model to reason through a problem first, then decide which tools or actions are necessary to reach the best solution.

ReAct is ideal for situations that require multi-step problem-solving with external tools. It also improves transparency by clearly showing the reasoning behind each tool choice, making it easier to understand and refine the model's actions.

In this tutorial, you will:

Learn how to craft prompts, tools, and evaluators in Phoenix
Refine your prompts to understand the power of ReAct prompting
Leverage Phoenix and LLM as a Judge techniques to evaluate accuracy at each step, gaining insight into the model's thought process.
Learn how to apply ReAct prompting in real-world scenarios for improved task execution and problem-solving.

⚠️ You'll need an OpenAI Key for this tutorial.

Let’s get started! 🚀

Set up Dependencies and Keys

!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.

import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

import nest_asyncio
import pandas as pd
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from openinference.instrumentation.openai import OpenAIInstrumentor

import phoenix as px
from phoenix.client import Client as PhoenixClient
from phoenix.client.types import PromptVersion
from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    OpenAIModel,
    llm_classify,
)
from phoenix.experiments import run_experiment
from phoenix.otel import register

nest_asyncio.apply()

Instrument Application

tracer_provider = register(
    project_name="ReAct-examples", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Load Dataset Into Phoenix

This dataset contains 20 customer service questions that a customer might ask a store's chatbot. As we dive into ReAct prompting, we'll use these questions to guide the LLM in selecting the appropriate tools.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

After running this cell, the dataset should will be under the Datasets tab in Phoenix.

from datasets import load_dataset

ds = load_dataset("syeddula/customer_questions")["train"]
ds = ds.to_pandas()
ds.head()
import uuid

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Questions"],
    dataset_name=f"customer-questions-{unique_id}",
)

Define Tools

Next, let's define the tools available for the LLM to use. We have five tools at our disposal, each serving a specific purpose: Product Comparison, Product Details, Discounts, Customer Support, and Track Package.

Depending on the customer's question, the LLM will determine the optimal sequence of tools to use.

tools = [
    {
        "type": "function",
        "function": {
            "name": "product_comparison",
            "description": "Compare features of two products.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_a_id": {
                        "type": "string",
                        "description": "The unique identifier of Product A.",
                    },
                    "product_b_id": {
                        "type": "string",
                        "description": "The unique identifier of Product B.",
                    },
                },
                "required": ["product_a_id", "product_b_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "product_details",
            "description": "Get detailed features on one product.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The unique identifier of the Product.",
                    }
                },
                "required": ["product_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "apply_discount_code",
            "description": "Checks for discounts and promotions. Applies a discount code to an order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "integer",
                        "description": "The unique identifier of the order.",
                    },
                    "discount_code": {
                        "type": "string",
                        "description": "The discount code to apply.",
                    },
                },
                "required": ["order_id", "discount_code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "customer_support",
            "description": "Get contact information for customer support regarding an issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "issue_type": {
                        "type": "string",
                        "description": "The type of issue (e.g., billing, technical support).",
                    }
                },
                "required": ["issue_type"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package.",
                    }
                },
                "required": ["tracking_number"],
            },
        },
    },
]

Initial Prompt

Let's start by defining a simple prompt that instructs the system to utilize the available tools to answer the questions. The choice of which tools to use, and how to apply them, is left to the model's discretion based on the context of each customer query.

params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="auto",
    messages=[
        {
            "role": "system",
            "content": """You are a helpful customer service agent.
            Your task is to determine the best tools to use to answer a customer's question.
            Output the tools and pick 3 tools at maximum.
            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support",
    version=PromptVersion.from_openai(params),
)

Next, we will define the Tool Calling Prompt Template. In this step, we use LLM as a Judge to evaluate the output. LLM as a Judge is a technique where one LLM assesses the performance of another LLM.

This prompt is provided to the LLM-as-Judge model, which takes in both the user's query and the tools the system has selected. The model then uses reasoning to assess how effectively the chosen tools addressed the query, providing an explanation for its evaluation.

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would reasonably help answer the question.
The tool calls have been generated by a separate agent, chosen from the list of
tools provided below. Your job is to decide whether that agent's response was relevant to solving the customer's question.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_calls}
    [END DATA]

Your response must be one of the following:
1. **"correct"** – The chosen tool(s) would sufficiently answer the question.
2. **"mostly_correct"** – The tool(s) are helpful, but a better selection could have been made (at most 1 missing or unnecessary tool).
3. **"incorrect"** – The tool(s) would not meaningfully help answer the question.

Explain why you made your choice.

    [Tool Definitions]:
    product_comparison: Compare features of two products.
    product_details: Get detailed features on one product.
    apply_discount_code: Applies a discount code to an order.
    customer_support: Get contact information for customer support regarding an issue.
    track_package: Track the status of a package based on the tracking number.
"""

In the following cells, we will define a task for the experiment. Then, in the evaluate_response function, we define our LLM as a Judge evaluator. Finally, we run our experiment.

def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Experiment

initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="initial-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

After running our experiment and evaluation, we can dive deeper into the results. By clicking into the experiment, we can explore the tools that the LLM selected for the specific input. Next, if we click on the trace for the evaluation, we can see the reasoning behind the score assigned by LLM as a Judge for the output.

ReAct Prompt

Next, we iterate on our system prompt using ReAct Prompting techniques. We emphasize that the model should think through the problem step-by-step, break it down logically, and then determine which tools to use and in what order. The model is instructed to output the relevant tools along with their corresponding parameters.

This approach differs from our initial prompt because it encourages reasoning before action, guiding the model to select the best tools and parameters based on the specific context of the query, rather than simply using predefined actions.

params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="required",
    messages=[
        {
            "role": "system",
            "content": """
              You are a helpful customer service agent. Carefully analyze the customer’s question to fully understand their request.
              Step 1: Think step-by-step. Identify the key pieces of information needed to answer the question. Consider any dependencies between these pieces of information.
              Step 2: Decide which tools to use. Choose up to 3 tools that will best retrieve the required information. If multiple tools are needed, determine the correct order to call them.
              Step 3: Output the chosen tools and any relevant parameters.

            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support ReAct Prompt",
    version=PromptVersion.from_openai(params),
)

In the Prompts tab, you will see the updated prompt. As you iterate, you can build a version history.

Just like above, we define our task, construct the evaluator, and run the experiment.

def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Experiment

initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="improved-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

With our updated ReAct prompt, we can observe that the LLM as a Judge Evaluator rated more outputs as correct. By clicking into the traces, we can gain insights into the reasons behind this improvement. By prompting our LLM to be more thoughtful and purposeful, we can see the reasoning and acting aspects of ReAct.

You can explore the evaluators outputs to better understand the improvements in detail.

Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.

To refine and test these prompts against other datasets, experiment with alternative techniques like Chain of Thought (CoT) prompting to assess how they complement or contrast with ReAct in your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using both the TypeScript and Python Clients.

From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️

Chain-of-Thought Prompting

LLMs excel at text generation, but their reasoning abilities depend on how we prompt them. Chain of Thought (CoT) prompting enhances logical reasoning by guiding the model to think step by step, improving accuracy in tasks like math, logic, and multi-step problem solving.

In this tutorial, you will:

Examine how different prompting techniques influence reasoning by evaluating model performance on a dataset.
Refine prompting strategies, progressing from basic approaches to structured reasoning.
Utilize Phoenix to assess accuracy at each stage and explore the model's thought process.
Learn how to apply CoT prompting effectively in real-world tasks.

⚠️ You'll need an OpenAI Key for this tutorial.

Let’s dive in! 🚀

Set up Dependencies and Keys

!pip install -qqqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.

import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset Into Phoenix

This dataset includes math word problems, step-by-step explanations, and their corresponding answers. As we refine our prompt, we'll test it against the dataset to measure and track improvements in performance.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

import uuid

from datasets import load_dataset

import phoenix as px
from phoenix.client import Client as PhoenixClient

ds = load_dataset("syeddula/math_word_problems")["train"]
ds = ds.to_pandas()
ds.head()

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Word Problem"],
    output_keys=["Answer"],
    dataset_name=f"wordproblems-{unique_id}",
)

Zero-Shot Prompting - Baseline

Zero-shot prompting is the simplest way to interact with a language model—it involves asking a question without providing any examples or reasoning steps. The model generates an answer based solely on its pre-trained knowledge.

This serves as our baseline for comparison. By evaluating its performance on our dataset, we can see how well the model solves math word problems without explicit guidance. In later sections, we’ll introduce structured reasoning techniques like Chain of Thought (CoT) to measure improvements in accuracy and answers.

from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

from phoenix.client.types import PromptVersion

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "You are an evaluator who outputs the answer to a math word problem. Only respond with the integer answer. Be sure not include words, explanations, symbols, labels, or units and round all decimals answers.",
        },
        {"role": "user", "content": "{{Problem}}"},
    ],
)

prompt_identifier = "wordproblems"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="A prompt for computing answers to word problems.",
    version=PromptVersion.from_openai(params),
)

Next, we will define a task and evaluator for the experiment. Then, we run our experiment.

Because our dataset has ground truth labels, we can use a simple function to extract the answer and check if the calculated answer matches the expected output.

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()


def zero_shot_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"Problem": input["Word Problem"]})
    )
    return resp.choices[0].message.content.strip()


def evaluate_response(output, expected):
    if not output.isdigit():
        return False
    return int(output) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=zero_shot_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot Prompt",
    experiment_name="zero-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

We can review the results of the experiment in Phoenix. We achieved ~75% accuracy in this run. In the following sections, we will iterate on this prompt and see how our evaluation changes!

Note: Throughout this tutorial, you will encounter various evaluator outcomes. At times, you may notice a decline in performance compared to the initial experiment. However, this is not necessarily a flaw. Variations in results can arise due to factors such as the choice of LLM, inherent model behaviors, and randomness.

Zero-Shot CoT Prompting

Zero-shot prompting provides a direct answer, but it often struggles with complex reasoning. Zero-Shot Chain of Thought (CoT) prompting improves this by explicitly instructing the model to think step by step before arriving at a final answer.

By adding a simple instruction like “Let’s think through this step by step,” we encourage the model to break down the problem logically. This structured reasoning can lead to more accurate answers, especially for multi-step math problems.

In this section, we'll compare Zero-Shot CoT against our baseline to evaluate its impact on performance. First, let's create the prompt.

zero_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem.

You must always think through the problem logically before providing an answer.

First, show some of your reasoning.

Then output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.

"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": zero_shot_COT_template},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

zero_shot_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Zero Shot COT prompt",
    version=PromptVersion.from_openai(params),
)

This updated prompt is now lives in Phoenix as a new prompt version.

Next, we run our task and evaluation by extracting the answer from the output of our LLM.

import re


def zero_shot_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **zero_shot_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=zero_shot_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot COT Prompt",
    experiment_name="zero-shot-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + zero_shot_COT.id},
)

By clicking into the experiment in Phoenix, you can take a look at the steps the model took the reach the answer. By telling the model to think through the problem and output reasoning, we see a performance improvement.

Self-Consistency CoT Prompting

Even with Chain of Thought prompting, a single response may not always be reliable. Self-Consistency CoT enhances accuracy by generating multiple reasoning paths and selecting the most common answer. Instead of relying on one response, we sample multiple outputs and aggregate them, reducing errors caused by randomness or flawed reasoning steps.

This method improves robustness, especially for complex problems where initial reasoning steps might vary. In this section, we'll compare Self-Consistency CoT to our previous prompts to see how using multiple responses impacts overall performance.

Let's repeat the same process as above with a new prompt and evaluate the outcome.

consistency_COT_template = """

You are an evaluator who outputs the answer to a math word problem.

Follow these steps:
1. Solve the problem **multiple times independently**, thinking through the solution carefully each time.
2. Show some of your reasoning for each independent attempt.
3. Identify the integer answer that appears most frequently across your attempts.
4. On a **new line**, output only this majority answer as a plain integer with **no words, commas, labels, units, or special characters**.
"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": consistency_COT_template},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

self_consistency_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="self consistency COT prompt",
    version=PromptVersion.from_openai(params),
)

def self_consistency_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **self_consistency_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=self_consistency_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Self Consistency COT Prompt",
    experiment_name="self-consistency-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + self_consistency_COT.id},
)

We've observed a significant improvement in performance! Since the prompt instructs the model to compute the answer multiple times independently, you may notice that the experiment takes slightly longer to run. You can click into the experiment explore to view the independent computations the model performed for each problem.

Few Shot CoT Prompting

Few-shot CoT prompting enhances reasoning by providing worked examples before asking the model to solve a new problem. By demonstrating step-by-step solutions, the model learns to apply similar logical reasoning to unseen questions.

This method leverages in-context learning, allowing the model to generalize patterns from the examples.

In this final section, we’ll compare Few-Shot CoT against our previous prompts.

First, let's construct our prompt by sampling examples from a test dataset.

ds = load_dataset("syeddula/math_word_problems")["test"]
few_shot_examples = ds.to_pandas().sample(5)
few_shot_examples

We now will construct our final prompt, run the experiment, and view the results. Under the Prompts tab in Phoenix, you can track the version history of your prompt and see what random examples were chosen.

few_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem. You must always think through the problem logically before providing an answer. Show some of your reasoning.

Finally, output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.

Here are some examples of word problems, step by step explanations, and solutions to guide your reasoning:

{examples}
"""
params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": few_shot_COT_template.format(examples=few_shot_examples)},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

few_shot_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Few Shot COT prompt",
    version=PromptVersion.from_openai(params),
)

def few_shot_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **few_shot_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset,
    task=few_shot_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Few-Shot COT Prompt",
    experiment_name="few-shot-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + few_shot_COT.id},
)

Final Results

After running all of your experiments, you can compare the performance of different prompting techniques. Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.

You can review your prompt version history in the Prompts tab and explore the Playground to iterate further and run additional experiments.

To refine and test these prompts against other datasets, experiment with Chain of Thought (CoT) prompting to see its relevance to your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using the TypeScript and Python Clients.

From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️

Prompt Optimization

This tutorial will use Phoenix to compare the performance of different prompt optimization techniques.

You'll start by creating an experiment in Phoenix that can house the results of each of your resulting prompts. Next you'll use a series of prompt optimization techniques to improve the performance of a jailbreak classification task. Each technique will be applied to the same base prompt, and the results will be compared using Phoenix.

The techniques you'll use are:

Few Shot Examples: Adding a few examples to the prompt to help the model understand the task.
Meta Prompting: Prompting a model to generate a better prompt based on previous inputs, outputs, and expected outputs.
Prompt Gradients: Using the gradient of the prompt to optimize individual components of the prompt using embeddings.
DSPy Prompt Tuning: Using DSPy, an automated prompt tuning library, to optimize the prompt.

⚠️ This tutorial requires and OpenAI API key.

Let's get started!

Setup Dependencies & Keys

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

Load Dataset into Phoenix

Since we'll be running a series of experiments, we'll need a dataset of test cases that we can run each time. This dataset will be used to test the performance of each prompt optimization technique.

Next, you can define a base template for the prompt. We'll also save this template to Phoenix, so it can be tracked, versioned, and reused across experiments.

You should now see that prompt in Phoenix:

Next you'll need a task and evaluator for the experiment. A task is a function that will be run across each example in the dataset. The task is also the piece of your code that you'll change between each run of the experiment. To start off, the task is simply a call to GPT 3.5 Turbo with a basic prompt.

You'll also need an evaluator that will be used to test the performance of the task. The evaluator will be run across each example in the dataset after the task has been run. Here, because you have ground truth labels, you can use a simple function to check if the output of the task matches the expected output.

You can also instrument your code to send all models calls to Phoenix. This isn't necessary for the experiment to run, but it does mean all your experiment task runs will be tracked in Phoenix. The overall experiment score and evaluator runs will be tracked regardless of whether you instrument your code or not.

Now you can run the initial experiment. This will be the base prompt that you'll be optimizing.

You should now see the initial experiment results in Phoenix:

Prompt Optimization Technique #1: Few Shot Examples

One common prompt optimization technique is to use few shot examples to guide the model's behavior.

Here you can add few shot examples to the prompt to help improve performance. Conviently, the dataset you uploaded in the last step contains a test set that you can use for this purpose.

Define a new prompt that includes the few shot examples. Prompts in Phoenix are automatically versioned, so saving the prompt with the same name will create a new version that can be used.

You'll notice you now have a new version of the prompt in Phoenix:

Define a new task with your new prompt:

Now you can run another experiment with the new prompt. The dataset of test cases and the evaluator will be the same as the previous experiment.

Prompt Optimization Technique #2: Meta Prompting

Meta prompting involves prompting a model to generate a better prompt, based on previous inputs, outputs, and expected outputs.

The experiment from round 1 serves as a great starting point for this technique, since it has each of those components.

Now construct a new prompt that will be used to generate a new prompt.

Now save that as a prompt in Phoenix:

Run this new prompt through the same experiment

Redefine the task, using the new prompt.

Prompt Optimization Technique #3: Prompt Gradient Optimization

Prompt gradient optimization is a technique that uses the gradient of the prompt to optimize individual components of the prompt using embeddings. It involves:

Converting the prompt into an embedding.
Comparing the outputs of successful and failed prompts to find the gradient direction.
Moving in the gradient direction to optimize the prompt.

Here you'll define a function to get embeddings for prompts, and then use that function to calculate the gradient direction between successful and failed prompts.

Run experiment with gradient-optimized prompt

Redefine the task, using the new prompt.

Prompt Optimization Technique #4: Prompt Tuning with DSPy

Finally, you can use an optimization library to optimize the prompt, like DSPy. supports each of the techniques you've used so far, and more.

DSPy makes a series of calls to optimize the prompt. It can be useful to see these calls in action. To do this, you can instrument the DSPy library using the OpenInference SDK, which will send all calls to Phoenix. This is optional, but it can be useful to have.

Now you'll setup the DSPy language model and define a prompt classification task.

Your classifier can now be used to make predictions as you would a normal LLM. It will expect a prompt input and will output a label prediction.

However, DSPy really shines when it comes to optimizing prompts. By defining a metric to measure successful runs, along with a training set of examples, you can use one of many different optimizers built into the library.

In this case, you'll use the MIPROv2 optimizer to find the best prompt for your task.

DSPy takes care of our prompts in this case, however you could still save the resulting prompt value in Phoenix:

Run experiment with DSPy-optimized classifier

Redefine the task, using the new prompt.

Prompt Optimization Technique #5: DSPy with GPT-4o

In the last example, you used GPT-3.5 Turbo to both run your pipeline, and optimize the prompt. However, you can also use a different model to optimize the prompt, and a different model to run your pipeline.

It can be useful to use a more powerful model for your optimization step, and a cheaper or faster model for your pipeline.

Here you'll use GPT-4o to optimize the prompt, and keep GPT-3.5 Turbo as your pipeline model.

Run experiment with DSPy-optimized classifier using GPT-4o

Redefine the task, using the new prompt.

Results

And just like that, you've run a series of prompt optimization techniques to improve the performance of a jailbreak classification task, and compared the results using Phoenix.

You should have a set of experiments that looks like this:

From here, you can check out more , and if you haven't already, ⭐️

LLM as a Judge Prompt Optimization

LLM as a Judge

An LLM as a Judge refers to using an LLM as a tool for evaluating and scoring responses based on predefined criteria.

While LLMs are powerful tools for evaluation, their performance can be inconsistent. Factors like ambiguity in the prompt, biases in the model, or a lack of clear guidelines can lead to unreliable results. By fine-tuning your LLM as a Judveprompts, you can improve the model's consistency, fairness, and accuracy, ensuring it delivers more reliable evaluations.

In this tutorial, you will:

Generate an LLM as a Judge evaluation prompt and test it against a datset
Learn about various optimization techniques to improve the template, measuring accuracy at each step using Phoenix evaluations
Understand how to apply these techniques together for better evaluation across your specific use cases

Set Up Dependencies and Keys

!pip install -q "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.

import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset into Phoenix

Phoenix offers many pre-built evaluation templates for LLM as a Judge, but often, you may need to build a custom evaluator for specific use cases.

In this tutorial, we will focus on creating an LLM as a Judge prompt designed to assess empathy and emotional intelligence in chatbot responses. This is especially useful for use cases like mental health chatbots or customer support interactions.

We will start by loading a dataset containing 30 chatbot responses, each with a score for empathy and emotional intelligence (out of 10). Throughout the tutorial, we’ll use our prompt to evaluate these responses and compare the output to the ground-truth labels. This will allow us to assess how well our prompt performs.

from datasets import load_dataset

ds = load_dataset("syeddula/empathy_scores")["test"]
ds = ds.to_pandas()
ds.head()

import uuid

import phoenix as px

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["AI_Response", "EI_Empathy_Score"],
    output_keys=["EI_Empathy_Score"],
    dataset_name=f"empathy-{unique_id}",
)

Generate LLM as a Judge Template using Meta Prompting

Before iterating on our template, we need to establish a prompt. Running the cell below will generate an LLM as a Judge prompt specifically for evaluating empathy and emotional intelligence. When generating this template, we emphasize:

Picking evaluation criteria (e.g., empathy, emotional support, emotional intelligence).
Defining a clear scoring system (1-10 scale with defined descriptions).
Setting response formatting guidelines for clarity and consistency.
Including an explanation for why the LLM selects a given score.

from openai import OpenAI

client = OpenAI()


def generate_eval_template():
    meta_prompt = """
    You are an expert in AI evaluation and emotional intelligence assessment. Your task is to create a structured evaluation template for assessing the emotional intelligence and empathy of AI responses to user inputs.

    ### Task Overview:
    Generate a detailed evaluation template that measures the AI’s ability to recognize user emotions, respond empathetically, and provide emotionally appropriate responses. The template should:
    - Include 3 to 5 distinct evaluation criteria that assess different aspects of emotional intelligence.
    - Define a scoring system on a scale of 1 to 10, ensuring a broad distribution of scores across different responses.
    - Provide clear, tiered guidelines for assigning scores, distinguishing weak, average, and strong performance.
    - Include a justification section requiring evaluators to explain the assigned score with specific examples.
    - Ensure the scoring rubric considers complexity and edge cases, preventing generic or uniform scores.

    ### Format:
    Return the evaluation template as plain text, structured with headings, criteria, and a detailed scoring rubric. The template should be easy to follow and apply to real-world datasets.

    ### Scoring Guidelines:
    - The scoring system must be on a **scale of 1 to 10** and encourage a full range of scores.
    - Differentiate between strong, average, and weak responses using specific, well-defined levels.
    - Require evaluators to justify scores

    Do not include any concluding remarks such as 'End of Template' or similar statements. The template should end naturally after the final section.

    """

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": meta_prompt}],
            temperature=0.9,  # High temperature for more creativity
        )

        return response.choices[0].message.content
    except Exception as e:
        return {"error": str(e)}


print("Generating new evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE = generate_eval_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE)

Testing Our Initial Prompt

Instrument the application to send traces to Phoenix:

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register(
    project_name="LLM-as-a-Judge", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Now that we have our baseline prompt, we need to set up two key components:

Task: The LLM as a Judge evaluation, where the model scores chatbot responses based on empathy and emotional intelligence.
Evaluator: A function that compares the LLM as a Judge output to the ground-truth labels from our dataset

Finally, we run our experiment. With this setup, we can measure how well our prompt initially performs.

import pandas as pd

from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)


def initial_prompt(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


def evaluate_response(input, output):
    expected_score = input["EI_Empathy_Score"]
    predicted_score = output
    return abs(expected_score - predicted_score) <= 2

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset, task=initial_prompt, evaluators=[evaluate_response], experiment_name="initial_prompt"
)

Iteration 1: Improve Accuracy

If you find that your LLM as a Judge prompt has low accuracy, we can make adjustmenets to the prompt to improve that. In this section, we explore 2 techniques for this: few shot examples and keeping a human in the loop.

Technique 1: Few Shot Examples

Few-shot examples help improve the accuracy of an LLM as a Judge prompt by providing clear reference points for evaluation. Instead of relying solely on general instructions, the model learns from labeled examples that demonstrate correct scoring and reasoning.

By including a mix of high, medium, and low-scoring responses, we help the model:

Understand nuanced criteria like empathy and emotional intelligence.
Reduce inconsistencies by aligning with real-world judgments.

few_shot_examples = """
---
To help guide your reasoning, compare the text to the following examples and scorings:

Example 1: "You should try to focus and get your work done. Everyone has tough days."
  Score: 2
  Explanation: No acknowledgment of the user’s emotions, dismissive and offers no real support

Example 2: "That sounds really stressful. Have you considered talking to a supervisor or taking breaks?"
  Score: 5
  Explanation: Acknowledges stress, but in a generic way. Provides advice, but not very personal. Could be warmer in tone.

Example 3: "I’m really sorry you’re feeling this way. It’s completely understandable to feel overwhelmed. You’re not alone in this. Have you had a chance to take a break or talk to someone who can support you?"
  Score: 9
  Explanation: Validates emotions, reassures the user, and offers support
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples

def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset,
    task=llm_as_a_judge,
    evaluators=[evaluate_response],
    experiment_name="few_shot_examples",
)

Technique 2: Human in the Loop

Keeping a human in the loop improves the accuracy of an LLM as a Judge by providing oversight, validation, and corrections where needed. In Phoenix, we can do this with annotations. While LLMs can evaluate responses based on predefined criteria, human reviewers help:

Catch edge cases and biases that the model may overlook.
Refine scoring guidelines by identifying inconsistencies in LLM outputs.
Continuously improve the prompt by analyzing where the model struggles and adjusting instructions accordingly.

However, human review can be costly and time-intensive, making full-scale annotation impractical. Fortunately, even a small number of human-labeled examples can significantly enhance accuracy.

Iteration 2: Reduce Bias

Style Invariant Evaluation

One common bias in LLM as a Judge evaluations is favoring certain writing styles over others. For example, the model might unintentionally rate formal, structured responses higher than casual or concise ones, even if both convey the same level of empathy or intelligence.

To reduce this bias, we focus on style-invariant evaluation, ensuring that the LLM judges responses based on content rather than phrasing or tone. This can be achieved by:

Providing diverse few-shot examples that include different writing styles.
Testing for bias by evaluating responses with varied phrasing and ensuring consistent scoring.

By making evaluations style-agnostic, we create a more robust scoring system that doesn’t unintentionally penalize certain tones.

style_invariant = """
----
To help guide your reasoning, below is an example of how different response styles and tones can achieve similar scores:

#### Scenario: Customer Support Handling a Late Order
User: "My order is late, and I needed it for an important event. This is really frustrating."

Response A (Formal): "I sincerely apologize for the delay..."
Response B (Casual): "Oh no, that’s really frustrating!..."
Response C (Direct): "Sorry about that. I’ll check..."
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + style_invariant

def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="style_invariant"
)

Iteration 3: Reduce Cost and Latency

Longer prompts increase computation costs and response times, making evaluations slower and more expensive. To optimize efficiency, we focus on condensing the prompt while preserving clarity and effectiveness. This is done by:

Removing redundant instructions and simplifying wording.
Using bullet points or structured formats for concise guidance.
Eliminating unnecessary explanations while keeping critical evaluation criteria intact.

A well-optimized prompt reduces token count, leading to faster, more cost-effective evaluations without sacrificing accuracy or reliability.

def generate_condensed_template():
    meta_prompt = """
    You are an expert in prompt engineering and LLM evaluation. Your task is to optimize a given LLM-as-a-judge prompt by reducing its word count significantly while maintaining all essential information, including evaluation criteria, scoring system, and purpose.

    Requirements:
    Preserve all key details such as metrics, scoring guidelines, and judgment criteria.

    Eliminate redundant phrasing and unnecessary explanations.

    Ensure clarity and conciseness without losing meaning.

    Maintain the prompt’s effectiveness for consistent evaluations.

    Output Format:
    Return only the optimized prompt as plain text, with no explanations or commentary.

    """

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "user",
                    "content": "Provided LLM-as-a-judge prompt"
                    + EMPATHY_EVALUATION_PROMPT_TEMPLATE,
                },
                {"role": "user", "content": meta_prompt},
            ],
            temperature=0.9,  # High temperature for more creativity
        )

        return response.choices[0].message.content
    except Exception as e:
        return {"error": str(e)}


print("Generating condensed evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED = generate_condensed_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED)

def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="condensed_prompt"
)

Iteration 4: Self-Refinement (Iterative LLM as Judge)

Self-refinement allows a Judge to improve its own evaluations by critically analyzing and adjusting its initial judgments. Instead of providing a static score, the model engages in an iterative process:

Generate an initial score based on the evaluation criteria.
Reflect on its reasoning, checking for inconsistencies or biases.
Refine the score if needed, ensuring alignment with the evaluation guidelines.

By incorporating this style of reasoning, the model can justify its decisions and self-correct errors.

refinement_text = """
---
After you have done the evaluation, follow these two steps:
1. Self-Critique
Review your initial score:
- Was it too harsh or lenient?
- Did it consider the full context?
- Would others agree with your score?
Explain any inconsistencies briefly.

2. Final Refinement
Based on your critique, adjust your score if necessary.
- Only output a number (1-10)
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + refinement_text

def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="self_refinement"
)

Iteration 5: Combining Techniques

To maximize the accuracy and fairness of our Judge, we will combine multiple optimization techniques. In this example, we will incorporate few-shot examples and style-invariant evaluation to ensure the model focuses on content rather than phrasing or tone.

By applying these techniques together, we aim to create a more reliable evaluation framework.

EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = (
    EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples + style_invariant
)

def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="combined"
)

Final Results

Techniques like few-shot examples, self-refinement, style-invariant evaluation, and prompt condensation each offer unique benefits, but their effectiveness will vary depending on the task.

Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM and other inherent model behaviors.

By systematically testing and combining these approaches, you can refine your evaluation framework.

Datasets & Experiments

Cookbooks

Iteratively improve your LLM task by building datasets, running experiments, and evaluating performance using code and LLM-as-a-judge.

Summarization

Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.

In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:

Upload a dataset of examples containing articles and human-written reference summaries to Phoenix
Define an experiment task that summarizes a news article
Devise evaluators for length and ROUGE score
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs

⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.

Let's get started!

Install Dependencies and Import Libraries

Install requirements and import libraries.

pip install anthropic "arize-phoenix>=4.6.0" openai openinference-instrumentation-openai rouge tiktoken

from typing import Any, Dict

import nest_asyncio
import pandas as pd

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", None)  # display full cells of dataframes

Launch Phoenix

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.

import phoenix as px

px.launch_app()

Instrument Your Application

from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Create Your Dataset

Download your data from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.

Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.

from datetime import datetime

from datasets import load_dataset

hf_ds = load_dataset("abisee/cnn_dailymail", "3.0.0")
df = (
    hf_ds["test"]
    .to_pandas()
    .sample(n=10, random_state=0)
    .set_index("id")
    .rename(columns={"highlights": "summary"})
)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
dataset = px.Client().upload_dataset(
    dataframe=df,
    input_keys=["article"],
    output_keys=["summary"],
    dataset_name=f"news-article-summaries-{now}",
)

Define Your Experiment Task

A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.

You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.

import os
from getpass import getpass

if os.environ.get("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

Next, define a function to format a prompt template and invoke an OpenAI model on an example.

from openai import AsyncOpenAI
from phoenix.experiments import Example

openai_client = AsyncOpenAI()


async def summarize_article_openai(example: Example, prompt_template: str, model: str) -> str:
    formatted_prompt_template = prompt_template.format(article=example.input["article"])
    response = await openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "assistant", "content": formatted_prompt_template},
        ],
    )
    assert response.choices
    return response.choices[0].message.content

From this function, you can use functools.partial to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.

import textwrap
from functools import partial

template = """
Summarize the article in two to four sentences:

ARTICLE
=======
{article}

SUMMARY
=======
"""
gpt_4o = "gpt-4o-2024-05-13"
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
test_example = dataset.examples[0]
print(textwrap.fill(await task(test_example), width=100))

Define Your Evaluators

Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:

ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.

Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a model fine-tuned on this particular dataset achieved a rouge score of ~0.44.

Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.

Note that you can use any third-party library you like while defining evaluators (in your case, rouge and tiktoken).

import tiktoken
from rouge import Rouge


# convenience functions
def _rouge_1(hypothesis: str, reference: str) -> Dict[str, Any]:
    scores = Rouge().get_scores(hypothesis, reference)
    return scores[0]["rouge-1"]


def _rouge_1_f1_score(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["f"]


def _rouge_1_precision(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["p"]


def _rouge_1_recall(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["r"]


# evaluators
def rouge_1_f1_score(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_f1_score(hypothesis=output, reference=expected["summary"])


def rouge_1_precision(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_precision(hypothesis=output, reference=expected["summary"])


def rouge_1_recall(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_recall(hypothesis=output, reference=expected["summary"])


def num_tokens(output: str) -> int:
    encoding = tiktoken.encoding_for_model(gpt_4o)
    return len(encoding.encode(output))


EVALUATORS = [rouge_1_f1_score, rouge_1_precision, rouge_1_recall, num_tokens]

Run Experiments and Iterate on Your Prompt Template

Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.

from phoenix.experiments import run_experiment

experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="initial-template",
    experiment_description="first experiment using a simple prompt template",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.

template = """
Summarize the article in two to four sentences. Be concise and include only the most important information.

ARTICLE
=======
{article}

SUMMARY
=======
"""
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="concise-template",
    experiment_description="explicitly instuct the llm to be concise",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

Inspecting the experiment results, you'll notice that the average num_tokens has indeed increased, but the generated summaries are still far more verbose than the reference summaries.

Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.

# examples to include (not included in the uploaded dataset)
train_df = (
    hf_ds["train"]
    .to_pandas()
    .sample(n=5, random_state=42)
    .head()
    .rename(columns={"highlights": "summary"})
)

example_template = """
ARTICLE
=======
{article}

SUMMARY
=======
{summary}
"""

examples = "\n".join(
    [
        example_template.format(article=row["article"], summary=row["summary"])
        for _, row in train_df.iterrows()
    ]
)

template = """
Summarize the article in two to four sentences. Be concise and include only the most important information, as in the examples below.

EXAMPLES
========

{examples}


Now summarize the following article.

ARTICLE
=======
{article}

SUMMARY
=======
"""

template = template.format(
    examples=examples,
    article="{article}",
)
print(template)

Now run the experiment.

task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="few-shot-template",
    experiment_description="include examples",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.

Compare With Another Model (Optional)

⚠️ This section requires an Anthropic API key.

Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.

First, enter your Anthropic API key if it is not already present.

import os
from getpass import getpass

if os.environ.get("ANTHROPIC_API_KEY") is None:
    os.environ["ANTHROPIC_API_KEY"] = getpass("🔑 Enter your Anthropic API key: ")

Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.

from anthropic import AsyncAnthropic

client = AsyncAnthropic()


async def summarize_article_anthropic(example: Example, prompt_template: str, model: str) -> str:
    formatted_prompt_template = prompt_template.format(article=example.input["article"])
    message = await client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": formatted_prompt_template}],
    )
    return message.content[0].text


claude_35_sonnet = "claude-3-5-sonnet-20240620"
task = partial(summarize_article_anthropic, prompt_template=template, model=claude_35_sonnet)

experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="anthropic-few-shot",
    experiment_description="anthropic",
    experiment_metadata={"vendor": "anthropic", "model": claude_35_sonnet},
    evaluators=EVALUATORS,
)

If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:

Here is a concise 3-sentence summary of the article...

See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.

Synopsis and Next Steps

Congrats! In this tutorial, you have:

Created a Phoenix dataset
Defined an experimental task and custom evaluators
Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall

As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the OpenAI fine-tuning API to train a bespoke model for your needs.

Text2SQL

Building effective text-to-SQL systems requires rigorous evaluation and systematic experimentation. In this tutorial, we'll walk through the complete evaluation-driven development process, starting from scratch without pre-existing datasets of questions or expected responses.

We'll use a movie database containing recent titles, ratings, box office performance, and metadata to demonstrate how to build, evaluate, and systematically improve a text-to-SQL system using Phoenix's experimentation framework. Think of Phoenix as your scientific laboratory, meticulously recording every experiment to help you build better AI systems.

!pip install "arize-phoenix>=11.0.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio openinference-instrumentation-openai --quiet

Let's first start a phoenix server to act as our evaluation dashboard and experiment tracker. This will be our central hub for observing, measuring, and improving our text-to-SQL system.

Note: this step is not necessary if you already have a Phoenix server running.

import phoenix as px

px.launch_app().view()

Let's also setup tracing for OpenAI. Tracing is crucial for evaluation-driven development - it allows Phoenix to observe every step of our text-to-SQL pipeline, capturing inputs, outputs, and metrics like latency and cost that we'll use to systematically improve our system.

from phoenix.otel import register

tracer_provider = register(
    endpoint="http://localhost:6006/v1/traces", auto_instrument=True, verbose=False
)  # Instruments all OpenAI calls

tracer = tracer_provider.get_tracer(__name__)

Let's make sure we can run async code in the notebook.

import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our OpenAI API key set up.

import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

Download Data

We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data.

import duckdb
from datasets import load_dataset

data = load_dataset("wykonos/movies")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("movies", data.to_pandas())

records = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")

for record in records:
    print(record)

Implement Text2SQL

Let's start by implementing a simple text2sql logic.

import os

import openai

client = openai.AsyncClient()

columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")

# We will use GPT-4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}


system_prompt = (
    "You are a SQL expert, and you are given a single table named movies with the following columns:\n"
    f'{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n'
    "Write a SQL query corresponding to the user's request. Return just the query text, "
    "with no formatting (backticks, markdown, etc.)."
)


@tracer.chain
async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content

query = await generate_query("what was the most popular movie?")
print(query)

Awesome, looks like the we are producing SQL! let's try running the query and see if we get the expected results.

@tracer.tool
def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


execute_query(query)

The Three Pillars of Evaluation

Effective AI evaluation rests on three fundamental pillars:

Data: Curated examples that represent real-world use cases
Task: The actual function or workflow being evaluated
Evaluators: Quantitative measures of performance

Let's start by creating our data - a set of movie-related questions that we want our text-to-SQL system to handle correctly.

questions = [
    "Which Brad Pitt movie received the highest rating?",
    "What is the top grossing Marvel movie?",
    "What foreign-language fantasy movie was the most popular?",
    "what are the best sci-fi movies of 2017?",
    "What anime topped the box office in the 2010s?",
    "Recommend a romcom that stars Paul Rudd.",
]

Let's store the data above as a versioned dataset in phoenix.

import pandas as pd

ds = px.Client().upload_dataset(
    dataset_name="movie-example-questions",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="movie-example-questions")

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

@tracer.chain
async def text2sql(question):
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

Finally, we'll define the evaluation scores. We'll use the following simple functions to see if the generated SQL queries are correct. Note that has_results is a good metric here because we know that all the questions we added to the dataset can be answered via SQL.

# Test if there are no sql execution errors


def no_error(output):
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the evaluation experiment.

import phoenix as px
from phoenix.experiments import run_experiment


# Define the task to run text2sql on the input question
def task(input):
    return text2sql(input["question"])


experiment = run_experiment(
    ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

Great! Let's see how our baseline model performed on the movie questions. We can analyze both successful queries and any failures to understand where improvements are needed.

Interpreting the results

Now that we ran the initial evaluation, let's analyze what might be causing any failures.

From looking at the query where there are no results, genre-related queries might fail because the model doesn't know how genres are stored (e.g., "Sci-Fi" vs "Science Fiction")

These types of issues would probably be improved by showing a sample of the data to the model (few-shot examples) since the data will show the LLM what is queryable.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

samples = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")

example_row = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
    for column in columns
)

column_header = " | ".join(column["column_name"] for column in columns)

few_shot_examples = "\n".join(
    " | ".join(str(sample[column["column_name"]]) for column in columns) for sample in samples
)

system_prompt = (
    "You are a SQL expert, and you are given a single table named `movies` with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{example_row}\n"
    "\n"
    "Examples:\n"
    f"{column_header}\n"
    f"{few_shot_examples}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content


print(await generate_query("what are the best sci-fi movies in the 2000s?"))

Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.

experiment = run_experiment(
    ds, task=task, evaluators=[has_results, no_error], experiment_metadata=CONFIG
)

Amazing. It looks like the LLM is generating a valid query for all questions. Let's try out using LLM as a judge to see how well it can assess the results.

import json

from openai import OpenAI

from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import EvaluationResult

openai_client = OpenAI()

judge_instructions = """
You are a judge that determines if a given question can be answered with the provided SQL query and results.
Make sure to ensure that the SQL query maps to the question accurately.

Provide the label `correct` if the SQL query and results accurately answer the question.
Provide the label `invalid` if the SQL query does not map to the question or is not valid.
"""


@create_evaluator(name="qa_correctness", kind="llm")
def qa_correctness(input, output):
    question = input.get("question")
    query = output.get("query")
    results = output.get("results")
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": judge_instructions},
            {
                "role": "user",
                "content": f"Question: {question}\nSQL Query: {query}\nSQL Results: {results}",
            },
        ],
        tool_choice="required",
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "qa_correctness",
                    "description": "Determine if the SQL query and results accurately answer the question.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "explanation": {
                                "type": "string",
                                "description": "Explain why the label is correct or invalid.",
                            },
                            "label": {"type": "string", "enum": ["correct", "invalid"]},
                        },
                    },
                },
            }
        ],
    )
    if response.choices[0].message.tool_calls is None:
        raise ValueError("No tool call found in response")
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    label = args["label"]
    explanation = args["explanation"]
    score = 1 if label == "correct" else 0
    return EvaluationResult(score=score, label=label, explanation=explanation)


evaluate_experiment(experiment, evaluators=[qa_correctness])

The LLM judge's scoring closely matches our manual evaluation, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement.

The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that 'Anime' and 'Animation' are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation.

We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in Golden Dataset for regression testing as well.

Generating more data

Let's generate some training data by having the model describe existing SQL queries from our dataset

import json
from typing import List

from pydantic import BaseModel


class Question(BaseModel):
    sql: str
    question: str


class Questions(BaseModel):
    questions: List[Question]


sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
    for column in columns
)
synthetic_data_prompt = f"""You are a SQL expert, and you are given a single table named movies with the following columns:

Column | Type | Example
-------|------|--------
{sample_rows}

Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as the
question that the query answers. Keep the questions bounded so that they are not too broad or too narrow."""

response = await client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": synthetic_data_prompt,
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "generate_questions",
                "description": "Generate SQL queries that would be interesting to ask about this table.",
                "parameters": Questions.model_json_schema(),
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "generate_questions"}},
)

assert response.choices[0].message.tool_calls is not None
generated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[
    "questions"
]

print("Generated N questions: ", len(generated_questions))
print("First question: ", generated_questions[0])

generated_dataset = []
for q in generated_questions:
    try:
        result = execute_query(q["sql"])
        example =  {
                "input": q["question"],
                "expected": {
                    "results": result or [],
                    "query": q["sql"],
                },
                "metadata": {
                    "category": "Generated",
                },
            }
        print(example)
        generated_dataset.append(example)
    except duckdb.Error as e:
        print(f"Query failed: {q['sql']}", e)
        print("Skipping...")

generated_dataset[0]

Awesome, let's create a dataset with the new synthetic data.

synthetic_dataset = px.Client().upload_dataset(
    dataset_name="movies-golden-synthetic",
    inputs=[{"question": example["input"]} for example in generated_dataset],
    outputs=[example["expected"] for example in generated_dataset],
);

exp = run_experiment(
    synthetic_dataset, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

exp.as_dataframe()

Great! We now have more data to work with. Here are some ways to improve it:

Review the generated data for issues
Refine the prompt
Show errors to the model

This gives us a process to keep improving our system.

Conclusion

In this tutorial, we built a text-to-SQL system for querying movie data. We started with basic examples and evaluators, then improved performance by adding few-shot examples as well as using an LLM judge for evaluation.

Key takeaways:

Start with simple evaluators to catch basic issues
Use few-shot examples to improve accuracy
Generate more training data using LLMs
Track progress with Phoenix's experiments

You can further improve this system by adding better evaluators or handling edge cases.

Evaluation

Cookbooks

Leverage the power of large language models to evaluate your generative model or application for hallucinations, toxicity, relevance of retrieved documents, and more.

Evaluate RAG

Building a RAG pipeline and evaluating it with Phoenix Evals.

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

Understanding Retrieval Augmented Generation (RAG).
Building RAG (with the help of a framework such as LlamaIndex).
Evaluating RAG with Phoenix Evals.

Retrieval Augmented Generation (RAG)

LLMs are trained on vast amounts of data, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

Loading: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
Indexing: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
Querying: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.

Build a RAG system

Now that we have understood the stages of RAG, let's build a pipeline. We will use for RAG and for evaluation.

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation.

Let's use an to build our RAG pipeline.

Load Data and Build an Index

Build a QueryEngine and start querying.

Check the response that you get from the query.

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in vector_index.as_query_engine(similarity_top_k=k).

Let's check the text in each of these retrieved nodes.

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

We can access the traces by directly pulling the spans from the phoenix session.

context.span_id

name

span_kind

attributes.input.value

attributes.retrieval.documents

Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.

context.span_id

attributes.input.value

attributes.retrieval.documents

We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved
Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided.

Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's llm_generate to help us create the question-context pairs.

First, let's create a dataframe of all the document chunks that we have indexed.

text

Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.

question_1

question_2

question_3

The LLM has generated three questions per chunk. Let's take a quick look.

text

question

Retrieval Evaluation

We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the ).

context.trace_id

input

reference

document_score

Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on explanations which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.

Let's compute Normalized Discounted Cumulative Gain at 2 for all our retrieval steps. In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.

Let's also compute precision at 2 for all our retrieval steps.

Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)

Let's now view the results in a combined dataframe.

Observations

Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.

As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.

We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.

Response Evaluation

The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.

Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the .

Observations

Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.

Our QA Correctness score of 0.91 and a Hallucinations score 0.05 signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.

Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.

We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.

Conclusion

We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.

Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the documentation.

OpenAI Agents SDK Cookbook

This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:

Create an agent using the OpenAI agents SDK
Trace the agent activity
Create a dataset to benchmark performance
Run an experiment to evaluate agent performance using LLM as a judge
Learn how to evaluate traces in production

Initial setup

Install Libraries

!pip install -q "arize-phoenix>=8.0.0" openinference-instrumentation-openai-agents openinference-instrumentation-openai --upgrade
!pip install -q openai nest_asyncio openai-agents

Setup Dependencies and Keys

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.

import os

import nest_asyncio

nest_asyncio.apply()

from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Setup Tracing

from phoenix.otel import register

# Setup Tracing
tracer_provider = register(
    project_name="openai-agents-cookbook",
    endpoint="https://app.phoenix.arize.com/v1/traces",
    auto_instrument=True,
)

Create your first agent with the OpenAI SDK

Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool.

We'll use the Runner class to run the agent and get the final output.

from agents import Runner, function_tool


@function_tool
def solve_equation(equation: str) -> str:
    """Use python to evaluate the math equation, instead of thinking about it yourself.

    Args:
       equation: string which to pass into eval() in python
    """
    return str(eval(equation))

from agents import Agent

agent = Agent(
    name="Math Solver",
    instructions="You solve math problems by evaluating them with python and returning the result",
    tools=[solve_equation],
)

result = await Runner.run(agent, "what is 15 + 28?")

# Run Result object
print(result)

# Get the final output
print(result.final_output)

# Get the entire list of messages recorded to generate the final output
print(result.to_input_list())

Now we have a basic agent, let's evaluate whether the agent responded correctly!

Evaluating our agent

Agents can go awry for a variety of reasons.

Tool call accuracy - did our agent choose the right tool with the right arguments?
Tool call results - did the tool respond with the right results?
Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?

We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals here.

Let's setup our evaluation by defining our task function, our evaluator, and our dataset.

import asyncio


# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.
async def solve_math_problem(dataset_row: dict):
    result = await Runner.run(agent, dataset_row.get("question"))
    return {
        "final_output": result.final_output,
        "messages": result.to_input_list(),
    }


dataset_row = {"question": "What is 15 + 28?"}

result = asyncio.run(solve_math_problem(dataset_row))
print(result)

Next, we create our evaluator.

import pandas as pd

from phoenix.evals import OpenAIModel, llm_classify


def correctness_eval(input, output):
    # Template for evaluating math problem solutions
    MATH_EVAL_TEMPLATE = """
    You are evaluating whether a math problem was solved correctly.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]

    Assess if the answer to the math problem is correct. First work out the correct answer yourself,
    then compare with the provided response. Consider that there may be different ways to express the same answer
    (e.g., "43" vs "The answer is 43" or "5.0" vs "5").

    Your answer must be a single word, either "correct" or "incorrect"
    """

    # Run the evaluation
    rails = ["correct", "incorrect"]
    eval_df = llm_classify(
        data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
        template=MATH_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=rails,
        provide_explanation=True,
    )
    label = eval_df["label"][0]
    score = 1 if label == "correct" else 0
    return score

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.

MATH_GEN_TEMPLATE = """
You are an assistant that generates diverse math problems for testing a math solver agent.
The problems should include:

Basic Operations: Simple addition, subtraction, multiplication, division problems.
Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.
Exponents and Roots: Problems involving powers, square roots, and other nth roots.
Percentages: Problems involving calculating percentages of numbers or finding percentage changes.
Fractions: Problems with addition, subtraction, multiplication, or division of fractions.
Algebra: Simple algebraic expressions that can be evaluated with specific values.
Sequences: Finding sums, products, or averages of number sequences.
Word Problems: Converting word problems into mathematical equations.

Do not include any solutions in your generated problems.

Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.
Generate 25 diverse math problems. Ensure there are no duplicate problems.
"""

import nest_asyncio

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 500)

# Initialize the model
model = OpenAIModel(model="gpt-4o", max_tokens=1300)

# Generate math problems
resp = model(MATH_GEN_TEMPLATE)

# Create DataFrame
split_response = resp.strip().split("\n")
math_problems_df = pd.DataFrame(split_response, columns=["question"])
print(math_problems_df.head())

Experiment in Development

During development, experimentation helps iterate quickly by revealing agent failures during evaluation. You can test against datasets to refine prompts, logic, and tool usage before deploying.

In this section, we run our agent against the dataset defined above and evaluate for correctness using LLM as Judge.

Create an experiment

With our dataset of questions we generated above, we can use our experiment feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

import uuid

import phoenix as px

unique_id = uuid.uuid4()

dataset_name = "math-questions-" + str(uuid.uuid4())[:5]

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=math_problems_df,
    input_keys=["question"],
    dataset_name=f"math-questions-{unique_id}",
)
print(dataset)

from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    dataset,
    task=solve_math_problem,
    evaluators=[correctness_eval],
    experiment_description="Solve Math Problems",
    experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)

View Traces in Phoenix

Evaluating in Production

In production, evaluation provides real-time insights into how agents perform on user data.

This section simulates a live production setting, showing how you can collect traces, model outputs, and evaluation results in real time.

Another option is to pull traces from completed production runs and batch process evaluations on them. You can then log the results of those evaluations in Phoenix.

!pip install openinference-instrumentation

from opentelemetry.trace import StatusCode, format_span_id

from phoenix.trace import SpanEvaluations

After importing the necessary libraries, we set up a tracer object to enable span creation for tracing our task function.

tracer = tracer_provider.get_tracer(__name__)

Next, we update our correctness evaluator to return both a label and an explanation, enabling metadata to be captured during tracing.

We also revise the task function to include with clauses that generate structured spans in Phoenix. These spans capture key details such as input values, output values, and the results of the evaluation.

# This is our modified correctness evaluator.
def correctness_eval(input, output):
    # Template for evaluating math problem solutions
    MATH_EVAL_TEMPLATE = """
    You are evaluating whether a math problem was solved correctly.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]

    Assess if the answer to the math problem is correct. First work out the correct answer yourself,
    then compare with the provided response. Consider that there may be different ways to express the same answer
    (e.g., "43" vs "The answer is 43" or "5.0" vs "5").

    Your answer must be a single word, either "correct" or "incorrect"
    """

    # Run the evaluation
    rails = ["correct", "incorrect"]
    eval_df = llm_classify(
        data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
        template=MATH_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=rails,
        provide_explanation=True,
    )

    return eval_df

# This is our modified task function.
async def solve_math_problem(dataset_row: dict):
    with tracer.start_as_current_span(name="agent", openinference_span_kind="agent") as agent_span:
        question = dataset_row.get("question")
        agent_span.set_input(question)
        agent_span.set_status(StatusCode.OK)

        result = await Runner.run(agent, question)
        agent_span.set_output(result.final_output)

        task_result = {
            "final_output": result.final_output,
            "messages": result.to_input_list(),
        }

        # Evaluation span for correctness
        with tracer.start_as_current_span(
            "correctness-evaluator",
            openinference_span_kind="evaluator",
        ) as eval_span:
            evaluation_result = correctness_eval(dataset_row, task_result)
            eval_span.set_attribute("eval.label", evaluation_result["label"][0])
            eval_span.set_attribute("eval.explanation", evaluation_result["explanation"][0])

        # Logging our evaluation
        span_id = format_span_id(eval_span.get_span_context().span_id)
        score = 1 if evaluation_result["label"][0] == "correct" else 0
        eval_data = {
            "span_id": span_id,
            "label": evaluation_result["label"][0],
            "score": score,
            "explanation": evaluation_result["explanation"][0],
        }
        df = pd.DataFrame([eval_data])
        px.Client().log_evaluations(
            SpanEvaluations(
                dataframe=df,
                eval_name="correctness",
            ),
        )

    return task_result


dataset_row = {"question": "What is 15 + 28?"}

result = asyncio.run(solve_math_problem(dataset_row))
print(result)

Finally, we run an experiment to simulate traces in production.

from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    dataset,
    task=solve_math_problem,
    experiment_description="Solve Math Problems",
    experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)

View Traces and Evaluator Results in Phoenix as Traces Populate

Retrieval & Inferences

Cookbooks

Use embeddings to explore lower-dimensional representations of your data, identifying clusters of high drift and performance degradation. Complement this with statistical analysis of structured data for A/B testing, temporal drift detection, and deeper performance insights.

Embeddings Analysis

Embedding Details

For each described in the inference (s), Phoenix serves a embeddings troubleshooting view to help you identify areas of drift and performance degradation. Let's start with embedding drift.

Embedding Drift Over Time

The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your inference set is diverging from a given reference baseline.

Note that when you are troubleshooting search and retrieval using inferences, the euclidean distance of your queries to your knowledge base vectors is presented as query distance.

Moments of high euclidean distance is an indication that the primary inference set is starting to drift from the reference inference set. As the primary inferences move further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.

For an in-depth guide of euclidean distance and embedding drift, check out

In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.

Clusters

Phoenix automatically breaks up your embeddings into groups of inferences using a clustering algorithm called . This is particularly useful if you are trying to identify areas of your embeddings that are drifting or performing badly.

When twos are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary inferences than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.

UMAP Point-Cloud

Phoenix projects the embeddings you provided into lower dimensional space (3 dimensions) using a dimension reduction algorithm called (stands for Uniform Manifold Approximation and Projection). This lets us understand how your in a visually understandable way.

In addition to the point-cloud, another dimension we have at our disposal is color (and in some cases shape). Out of the box phoenix let's you assign colors to the UMAP point-cloud by dimension (features, tags, predictions, actuals), performance (correctness which distinguishes true positives and true negatives from the incorrect predictions), and inference (to highlight areas of drift). This helps you explore your point-cloud from different perspectives depending on what you are looking for.

Evaluate an Agent

This notebook serves as an end-to-end example of how to trace and evaluate an agent. The example uses a "talk-to-your-data" agent as its example.

The notebook shows examples of:

Manually instrumenting an agent using Phoenix decorators
Evaluating function calling accuracy using LLM as a Judge
Evaluating function calling accuracy by comparing to ground truth
Evaluating SQL query generation
Evaluating Python code generation
Evaluating the path of an agent

Install Dependencies, Import Libraries, Set API Keys

!pip install -q openai "arize-phoenix>=8.8.0" "arize-phoenix-otel>=0.8.0" openinference-instrumentation-openai python-dotenv duckdb "openinference-instrumentation>=0.1.21"

import dotenv

dotenv.load_dotenv()

import json
import os
from getpass import getpass

import duckdb
import pandas as pd
from IPython.display import Markdown
from openai import OpenAI
from openinference.instrumentation import (
    suppress_tracing,
)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import StatusCode
from pydantic import BaseModel, Field
from tqdm import tqdm

from phoenix.otel import register

if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI()
model = "gpt-4o-mini"
project_name = "talk-to-your-data-agent"

Enable Phoenix Tracing

if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

tracer_provider = register(
    project_name=project_name,
)

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

tracer = tracer_provider.get_tracer(__name__)

Prepare dataset

Your agent will interact with a local database. Start by loading in that data:

store_sales_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/llama-index/Store_Sales_Price_Elasticity_Promotions_Data.parquet"
)
store_sales_df.head()

Define the tools

Now you can define your agent tools.

Tool 1: Database Lookup

SQL_GENERATION_PROMPT = """
Generate an SQL query based on a prompt. Do not reply with anything besides the SQL query.
The prompt is: {prompt}

The available columns are: {columns}
The table name is: {table_name}
"""


def generate_sql_query(prompt: str, columns: list, table_name: str) -> str:
    """Generate an SQL query based on a prompt"""
    formatted_prompt = SQL_GENERATION_PROMPT.format(
        prompt=prompt, columns=columns, table_name=table_name
    )

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": formatted_prompt}],
    )

    return response.choices[0].message.content


@tracer.tool()
def lookup_sales_data(prompt: str) -> str:
    """Implementation of sales data lookup from parquet file using SQL"""
    try:
        table_name = "sales"
        # Read the parquet file into a DuckDB table
        duckdb.sql(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM store_sales_df")

        print(store_sales_df.columns)
        print(table_name)
        sql_query = generate_sql_query(prompt, store_sales_df.columns, table_name)
        sql_query = sql_query.strip()
        sql_query = sql_query.replace("```sql", "").replace("```", "")

        with tracer.start_as_current_span(
            "execute_sql_query", openinference_span_kind="chain"
        ) as span:
            span.set_input(value=sql_query)

            # Execute the SQL query
            result = duckdb.sql(sql_query).df()
            span.set_output(value=str(result))
            span.set_status(StatusCode.OK)
        return result.to_string()
    except Exception as e:
        return f"Error accessing data: {str(e)}"

example_data = lookup_sales_data("Show me all the sales for store 1320 on November 1st, 2021")
example_data

Tool 2: Data Visualization

class VisualizationConfig(BaseModel):
    chart_type: str = Field(..., description="Type of chart to generate")
    x_axis: str = Field(..., description="Name of the x-axis column")
    y_axis: str = Field(..., description="Name of the y-axis column")
    title: str = Field(..., description="Title of the chart")


@tracer.chain()
def extract_chart_config(data: str, visualization_goal: str) -> dict:
    """Generate chart visualization configuration

    Args:
        data: String containing the data to visualize
        visualization_goal: Description of what the visualization should show

    Returns:
        Dictionary containing line chart configuration
    """
    prompt = f"""Generate a chart configuration based on this data: {data}
    The goal is to show: {visualization_goal}"""

    response = client.beta.chat.completions.parse(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format=VisualizationConfig,
    )

    try:
        # Extract axis and title info from response
        content = response.choices[0].message.content

        # Return structured chart config
        return {
            "chart_type": content.chart_type,
            "x_axis": content.x_axis,
            "y_axis": content.y_axis,
            "title": content.title,
            "data": data,
        }
    except Exception:
        return {
            "chart_type": "line",
            "x_axis": "date",
            "y_axis": "value",
            "title": visualization_goal,
            "data": data,
        }


@tracer.chain()
def create_chart(config: VisualizationConfig) -> str:
    """Create a chart based on the configuration"""
    prompt = f"""Write python code to create a chart based on the following configuration.
    Only return the code, no other text.
    config: {config}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    code = response.choices[0].message.content
    code = code.replace("```python", "").replace("```", "")
    code = code.strip()

    return code


@tracer.tool()
def generate_visualization(data: str, visualization_goal: str) -> str:
    """Generate a visualization based on the data and goal"""
    config = extract_chart_config(data, visualization_goal)
    code = create_chart(config)
    return code

# code = generate_visualization(example_data, "A line chart of sales over each day in november.")

@tracer.tool()
def run_python_code(code: str) -> str:
    """Execute Python code in a restricted environment"""
    # Create restricted globals/locals dictionaries with plotting libraries
    restricted_globals = {
        "__builtins__": {
            "print": print,
            "len": len,
            "range": range,
            "sum": sum,
            "min": min,
            "max": max,
            "int": int,
            "float": float,
            "str": str,
            "list": list,
            "dict": dict,
            "tuple": tuple,
            "set": set,
            "round": round,
            "__import__": __import__,
            "json": __import__("json"),
        },
        "plt": __import__("matplotlib.pyplot"),
        "pd": __import__("pandas"),
        "np": __import__("numpy"),
        "sns": __import__("seaborn"),
    }

    try:
        # Execute code in restricted environment
        exec_locals = {}
        exec(code, restricted_globals, exec_locals)

        # Capture any printed output or return the plot
        exec_locals.get("__builtins__", {}).get("_", "")
        if "plt" in exec_locals:
            return exec_locals["plt"]

        # Try to parse output as JSON before returning
        return "Code executed successfully"

    except Exception as e:
        return f"Error executing code: {str(e)}"

Tool 3: Data Analysis

@tracer.tool()
def analyze_sales_data(prompt: str, data: str) -> str:
    """Implementation of AI-powered sales data analysis"""
    # Construct prompt based on analysis type and data subset
    prompt = f"""Analyze the following data: {data}
    Your job is to answer the following question: {prompt}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    analysis = response.choices[0].message.content
    return analysis if analysis else "No analysis could be generated"

# analysis = analyze_sales_data("What is the most popular product SKU?", example_data)
# analysis

Tool Schema:

You'll need to pass your tool descriptions into your agent router. The following code allows you to easily do so:

# Define tools/functions that can be called by the model
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_sales_data",
            "description": "Look up data from Store Sales Price Elasticity Promotions dataset",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    }
                },
                "required": ["prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "analyze_sales_data",
            "description": "Analyze sales data to extract insights",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    },
                },
                "required": ["data", "prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "generate_visualization",
            "description": "Generate Python code to create data visualizations",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "visualization_goal": {
                        "type": "string",
                        "description": "The goal of the visualization.",
                    },
                },
                "required": ["data", "visualization_goal"],
            },
        },
    },
    # {
    #     "type": "function",
    #     "function": {
    #         "name": "run_python_code",
    #         "description": "Run Python code in a restricted environment",
    #         "parameters": {
    #             "type": "object",
    #             "properties": {
    #                 "code": {"type": "string", "description": "The Python code to run."}
    #             },
    #             "required": ["code"]
    #         }
    #     }
    # }
]

# Dictionary mapping function names to their implementations
tool_implementations = {
    "lookup_sales_data": lookup_sales_data,
    "analyze_sales_data": analyze_sales_data,
    "generate_visualization": generate_visualization,
    # "run_python_code": run_python_code
}

Agent logic

With the tools defined, you're ready to define the main routing and tool call handling steps of your agent.

@tracer.chain()
def handle_tool_calls(tool_calls, messages):
    for tool_call in tool_calls:
        function = tool_implementations[tool_call.function.name]
        function_args = json.loads(tool_call.function.arguments)
        result = function(**function_args)

        messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})
    return messages

def start_main_span(messages):
    print("Starting main span with messages:", messages)

    with tracer.start_as_current_span("AgentRun", openinference_span_kind="agent") as span:
        span.set_input(value=messages)
        ret = run_agent(messages)
        print("Main span completed with return value:", ret)
        span.set_output(value=ret)
        span.set_status(StatusCode.OK)
        return ret


def run_agent(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router call span")
        with tracer.start_as_current_span(
            "router_call",
            openinference_span_kind="chain",
        ) as span:
            span.set_input(value=messages)

            response = client.chat.completions.create(
                model=model,
                messages=messages,
                tools=tools,
            )

            messages.append(response.choices[0].message.model_dump())
            tool_calls = response.choices[0].message.tool_calls
            print("Received response with tool calls:", bool(tool_calls))
            span.set_status(StatusCode.OK)

            if tool_calls:
                # Tool calls span
                print("Processing tool calls")
                messages = handle_tool_calls(tool_calls, messages)
                span.set_output(value=tool_calls)
            else:
                print("No tool calls, returning final response")
                span.set_output(value=response.choices[0].message.content)

                return response.choices[0].message.content

Run the agent

Your agent is now good to go! Let's try it out with some example questions:

ret = start_main_span([{"role": "user", "content": "Create a line chart showing sales in 2021"}])
print(Markdown(ret))

agent_questions = [
    "What was the most popular product SKU?",
    "What was the total revenue across all stores?",
    "Which store had the highest sales volume?",
    "Create a bar chart showing total sales by store",
    "What percentage of items were sold on promotion?",
    "Plot daily sales volume over time",
    "What was the average transaction value?",
    "Create a box plot of transaction values",
    "Which products were frequently purchased together?",
    "Plot a line graph showing the sales trend over time with a 7-day moving average",
]

for question in tqdm(agent_questions, desc="Processing questions"):
    try:
        ret = start_main_span([{"role": "user", "content": question}])
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        continue

Evaluating the agent

So your agent looks like it's working, but how can you measure its performance?

OpenAIInstrumentor().uninstrument()  # Uninstrument the OpenAI client to avoid capturing LLM as a Judge evaluation calls in your same project.

import nest_asyncio

import phoenix as px
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify
from phoenix.experiments import evaluate_experiment, run_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

nest_asyncio.apply()

px_client = px.Client()
eval_model = OpenAIModel(model="gpt-4o-mini")

Function Calling Evals using LLM as a Judge

This first evaluation will evaluate your agent router choices using another LLM.

It follows a standard pattern:

Export traces from Phoenix
Prepare those exported traces in a dataframe with the correct columns
Use llm_classify to run a standard template across each row of that dataframe and produce an eval label
Upload the results back into Phoenix

query = (
    SpanQuery()
    .where(
        "span_kind == 'LLM'",
    )
    .select(question="input.value", output_messages="llm.output_messages")
)

# The Phoenix Client can take this query and return the dataframe.
tool_calls_df = px.Client().query_spans(query, project_name=project_name, timeout=None)
tool_calls_df.dropna(subset=["output_messages"], inplace=True)


def get_tool_call(outputs):
    if outputs[0].get("message").get("tool_calls"):
        return (
            outputs[0]
            .get("message")
            .get("tool_calls")[0]
            .get("tool_call")
            .get("function")
            .get("name")
        )
    else:
        return "No tool used"


tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_call)
tool_calls_df.head()

tool_call_eval = llm_classify(
    dataframe=tool_calls_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
        "{tool_definitions}",
        "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
    ),
    rails=["correct", "incorrect"],
    model=eval_model,
    provide_explanation=True,
)

tool_call_eval["score"] = tool_call_eval.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

tool_call_eval.head()

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval),
)

You should now see eval labels in Phoenix.

Function Calling Evals using Ground Truth

The above example works, however if you have ground truth labled data, you can use that data to get an even more accurate measure of your router's performance by running an experiments.

Experiments also follow a standard step-by-step process in Phoenix:

Create a dataset of test cases, and optionally, expected outputs
Create a task to run on each test case - usually this is invoking your agent or a specifc step of it
Create evaluator(s) to run on each output of your task
Visualize results in Phoenix

import uuid

id = str(uuid.uuid4())

agent_tool_responses = {
    "What was the most popular product SKU?": "lookup_sales_data, analyze_sales_data",
    "What was the total revenue across all stores?": "lookup_sales_data, analyze_sales_data",
    "Which store had the highest sales volume?": "lookup_sales_data, analyze_sales_data",
    "Create a bar chart showing total sales by store": "generate_visualization, lookup_sales_data, run_python_code",
    "What percentage of items were sold on promotion?": "lookup_sales_data, analyze_sales_data",
    "Plot daily sales volume over time": "generate_visualization, lookup_sales_data, run_python_code",
    "What was the average transaction value?": "lookup_sales_data, analyze_sales_data",
    "Create a box plot of transaction values": "generate_visualization, lookup_sales_data, run_python_code",
    "Which products were frequently purchased together?": "lookup_sales_data, analyze_sales_data",
    "Plot a line graph showing the sales trend over time with a 7-day moving average": "generate_visualization, lookup_sales_data, run_python_code",
}


tool_calling_df = pd.DataFrame(agent_tool_responses.items(), columns=["question", "tool_calls"])
dataset = px_client.upload_dataset(
    dataframe=tool_calling_df,
    dataset_name=f"tool_calling_ground_truth_{id}",
    input_keys=["question"],
    output_keys=["tool_calls"],
)

For your task, you can simply run just the router call of your agent:

def run_router_step(example: Example) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
    ]
    messages.append({"role": "user", "content": example.input.get("question")})

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        tools=tools,
    )
    tool_calls = []
    for tool_call in response.choices[0].message.tool_calls:
        tool_calls.append(tool_call.function.name)
    return tool_calls

Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:

def tools_match(expected: str, output: str) -> bool:
    expected_tools = expected.get("tool_calls").split(", ")
    return expected_tools == output

experiment = run_experiment(
    dataset,
    run_router_step,
    evaluators=[tools_match],
    experiment_name="Tool Calling Eval",
    experiment_description="Evaluating the tool calling step of the agent",
)

Tool Evals

The next piece of your agent to evaluate is its tools. Each tool is usually evaluated differently - we've included some examples below. If you need other ideas, Phoenix's built-in evaluators give you an idea of other metrics to use.

Evaluating our SQL generation tool

# This step will be replaced by a human annotated set of ground truth data, instead of generated examples

db_lookup_questions = [
    "What was the most popular product SKU?",
    "Which store had the highest total sales value?",
    "How many items were sold on promotion?",
    "What was the average quantity sold per transaction?",
    "Which product class code generated the most revenue?",
    "What day of the week had the highest sales volume?",
    "How many unique stores made sales?",
    "What was the highest single transaction value?",
    "Which products were frequently sold together?",
    "What's the trend in sales over time?",
]

expected_results = []

for question in tqdm(db_lookup_questions, desc="Processing SQL lookup questions"):
    try:
        with suppress_tracing():
            expected_results.append(lookup_sales_data(question))
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        db_lookup_questions.remove(question)

# Create a DataFrame with the questions
questions_df = pd.DataFrame({"question": db_lookup_questions, "expected_result": expected_results})

display(questions_df)

dataset = px_client.upload_dataset(
    dataframe=questions_df,
    dataset_name=f"sales_db_lookup_questions_{id}",
    input_keys=["question"],
    output_keys=["expected_result"],
)

def run_sql_query(example: Example) -> str:
    with suppress_tracing():
        return lookup_sales_data(example.input.get("question"))

def evaluate_sql_result(output: str, expected: str) -> bool:
    # Extract just the numbers from both strings
    result_nums = "".join(filter(str.isdigit, output))
    expected_nums = "".join(filter(str.isdigit, expected.get("expected_result")))
    return result_nums == expected_nums

experiment = run_experiment(
    dataset,
    run_sql_query,
    evaluators=[evaluate_sql_result],
    experiment_name="SQL Query Eval",
    experiment_description="Evaluating the SQL query generation step of the agent",
)

Evaluating our Python code generation tool

# Replace this with a human annotated set of ground truth data, instead of generated examples

code_generation_questions = [
    "Create a bar chart showing total sales by store",
    "Plot daily sales volume over time",
    "Plot a line graph showing the sales trend over time with a 7-day moving average",
    "Create a histogram of quantities sold per transaction",
    "Generate a pie chart showing sales distribution across product classes",
    "Create a stacked bar chart showing promotional vs non-promotional sales by store",
    "Generate a heatmap of sales by day of week and store number",
    "Plot a line chart comparing sales trends between top 5 stores",
]

example_data = []
chart_configs = []
for question in tqdm(code_generation_questions[:], desc="Processing code generation questions"):
    try:
        with suppress_tracing():
            example_data.append(lookup_sales_data(question))
            chart_configs.append(json.dumps(extract_chart_config(example_data[-1], question)))
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        code_generation_questions.remove(question)

code_generation_df = pd.DataFrame(
    {
        "question": code_generation_questions,
        "example_data": example_data,
        "chart_configs": chart_configs,
    }
)

dataset = px_client.upload_dataset(
    dataframe=code_generation_df,
    dataset_name=f"code_generation_questions_{id}",
    input_keys=["question", "example_data", "chart_configs"],
)

def run_code_generation(example: Example) -> str:
    with suppress_tracing():
        chart_config = extract_chart_config(
            data=example.input.get("example_data"), visualization_goal=example.input.get("question")
        )
        code = generate_visualization(
            visualization_goal=example.input.get("question"), data=example.input.get("example_data")
        )

    return {"code": code, "chart_config": chart_config}

In this case, you don't have ground truth data to compare to. Instead you can just use a simple code evaluator: trying to run the generated code and catching any errors.

def code_is_runnable(output: str) -> bool:
    """Check if the code is runnable"""
    output = output.get("code")
    output = output.strip()
    output = output.replace("```python", "").replace("```", "")
    try:
        exec(output)
        return True
    except Exception:
        return False

def evaluate_chart_config(output: str, expected: str) -> bool:
    return output.get("chart_config") == expected.get("chart_config")

experiment = run_experiment(
    dataset,
    run_code_generation,
    evaluators=[code_is_runnable, evaluate_chart_config],
    experiment_name="Code Generation Eval",
    experiment_description="Evaluating the code generation step of the agent",
)

Evaluating the agent path and convergence

Finally, the last piece of your agent to evaluate is its path. This is important to evaluate to understand how efficient your agent is in its execution. Does it need to call the same tool multiple times? Does it skip steps it shouldn't, and have to backtrack later? Convergence or path evals can tell you this.

Convergence evals operate slightly differently. The one you'll use below relies on knowing the minimum number of steps taken by the agent for a given type of query. Instead of just running an experiment, you'll run an experiment then after it completes, attach a second evaluator to calculate convergence.

The workflow is as follows:

Create a dataset of the same type of question, phrased different ways each time - the agent should take the same path for each, but you'll often find it doesn't.
Create a task that runs the agent on each question, while tracking the number of steps it takes.
Run the experiment without an evaluator.
Calculate the minimum number of steps taken to complete the task.
Create an evaluator that compares the steps taken of each run against that min step number.
Run this evaluator on your experiment from step 3.
View your results in Phoenix

# Replace this with a human annotated set of ground truth data, instead of generated examples

convergence_questions = [
    "What was the average quantity sold per transaction?",
    "What is the mean number of items per sale?",
    "Calculate the typical quantity per transaction",
    "Show me the average number of units sold in each transaction",
    "What's the mean transaction size in terms of quantity?",
    "On average, how many items were purchased per transaction?",
    "What is the average basket size per sale?",
    "Calculate the mean number of products per purchase",
    "What's the typical number of units per order?",
    "Find the average quantity of items in each transaction",
    "What is the average number of products bought per purchase?",
    "Tell me the mean quantity of items in a typical transaction",
    "How many items does a customer buy on average per transaction?",
    "What's the usual number of units in each sale?",
    "Calculate the average basket quantity per order",
    "What is the typical amount of products per transaction?",
    "Show the mean number of items customers purchase per visit",
    "What's the average quantity of units per shopping trip?",
    "How many products do customers typically buy in one transaction?",
    "What is the standard basket size in terms of quantity?",
]

convergence_df = pd.DataFrame({"question": convergence_questions})

dataset = px_client.upload_dataset(
    dataframe=convergence_df, dataset_name="convergence_questions", input_keys=["question"]
)

def format_message_steps(messages):
    """
    Convert a list of message objects into a readable format that shows the steps taken.

    Args:
        messages (list): A list of message objects containing role, content, tool calls, etc.

    Returns:
        str: A readable string showing the steps taken.
    """
    steps = []
    for message in messages:
        role = message.get("role")
        if role == "user":
            steps.append(f"User: {message.get('content')}")
        elif role == "system":
            steps.append("System: Provided context")
        elif role == "assistant":
            if message.get("tool_calls"):
                for tool_call in message["tool_calls"]:
                    tool_name = tool_call["function"]["name"]
                    steps.append(f"Assistant: Called tool '{tool_name}'")
            else:
                steps.append(f"Assistant: {message.get('content')}")
        elif role == "tool":
            steps.append(f"Tool response: {message.get('content')}")

    return "\n".join(steps)

def run_agent_and_track_path(example: Example) -> str:
    print("Starting main span with messages:", example.input.get("question"))
    messages = [{"role": "user", "content": example.input.get("question")}]
    ret = run_agent_messages(messages)
    return {"path_length": len(ret), "messages": format_message_steps(ret)}


def run_agent_messages(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router")

        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )

        messages.append(response.choices[0].message.model_dump())
        tool_calls = response.choices[0].message.tool_calls
        print("Received response with tool calls:", bool(tool_calls))

        if tool_calls:
            # Tool calls span
            print("Processing tool calls")
            tool_calls = response.choices[0].message.tool_calls
            messages = handle_tool_calls(tool_calls, messages)
        else:
            print("No tool calls, returning final response")
            return messages

experiment = run_experiment(
    dataset,
    run_agent_and_track_path,
    experiment_name="Convergence Eval",
    experiment_description="Evaluating the convergence of the agent",
)

experiment.as_dataframe()

outputs = experiment.as_dataframe()["output"].to_dict().values()
optimal_path_length = min(
    output.get("path_length")
    for output in outputs
    if output and output.get("path_length") is not None
)
print(f"The optimal path length is {optimal_path_length}")

@create_evaluator(name="Convergence Eval", kind="CODE")
def evaluate_path_length(output: str) -> float:
    if output and output.get("path_length"):
        return optimal_path_length / float(output.get("path_length"))
    else:
        return 0

experiment = evaluate_experiment(experiment, evaluators=[evaluate_path_length])

Advanced - Combining all the evals into our experiment

As an optional final step, you can combine all the evaluators and experiments above into a single experiment. This requires some more advanced data wrangling, but gives you a single report on your agent's performance.

Build a version of our agent that tracks all the necessary information for evals

def process_messages(messages):
    tool_calls = []
    tool_responses = []
    final_output = None

    for i, message in enumerate(messages):
        # Extract tool calls
        if "tool_calls" in message and message["tool_calls"]:
            for tool_call in message["tool_calls"]:
                tool_name = tool_call["function"]["name"]
                tool_input = tool_call["function"]["arguments"]
                tool_calls.append(tool_name)

                # Prepare tool response structure with tool name and input
                tool_responses.append(
                    {"tool_name": tool_name, "tool_input": tool_input, "tool_response": None}
                )

        # Extract tool responses
        if message["role"] == "tool" and "tool_call_id" in message:
            for tool_response in tool_responses:
                if message["tool_call_id"] in message.values():
                    tool_response["tool_response"] = message["content"]

        # Extract final output
        if (
            message["role"] == "assistant"
            and not message.get("tool_calls")
            and not message.get("function_call")
        ):
            final_output = message["content"]

    result = {
        "tool_calls": tool_calls,
        "tool_responses": tool_responses,
        "final_output": final_output,
        "unchanged_messages": messages,
        "path_length": len(messages),
    }

    return result

def run_agent_and_track_path_combined(example: Example) -> str:
    print("Starting main span with messages:", example.input.get("question"))
    messages = [{"role": "user", "content": example.input.get("question")}]
    ret = run_agent_messages_combined(messages)
    return process_messages(ret)


def run_agent_messages_combined(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router")

        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )

        messages.append(response.choices[0].message.model_dump())
        tool_calls = response.choices[0].message.tool_calls
        print("Received response with tool calls:", bool(tool_calls))

        if tool_calls:
            # Tool calls span
            print("Processing tool calls")
            tool_calls = response.choices[0].message.tool_calls
            messages = handle_tool_calls(tool_calls, messages)
        else:
            print("No tool calls, returning final response")
            return messages

generate_sql_query("What was the most popular product SKU?", store_sales_df.columns, "sales")

overall_experiment_questions = [
    {
        "question": "What was the most popular product SKU?",
        "sql_result": "   SKU_Coded  Total_Qty_Sold 0    6200700         52262.0",
    },
    {
        "question": "What was the total revenue across all stores?",
        "sql_result": "   Total_Revenue 0   1.327264e+07",
    },
    {
        "question": "Which store had the highest sales volume?",
        "sql_result": "   Store_Number  Total_Sales_Volume 0          2970             59322.0",
    },
    {
        "question": "Create a bar chart showing total sales by store",
        "sql_result": "    Store_Number    Total_Sales 0            880  420302.088397 1           1650  580443.007953 2           4180  272208.118542 3            550  229727.498752 4           1100  497509.528013 5           3300  619660.167018 6           3190  335035.018792 7           2970  836341.327191 8           3740  359729.808228 9           2530  324046.518720 10          4400   95745.620250 11          1210  508393.767785 12           330  370503.687331 13          2750  453664.808068 14          1980  242290.828499 15          1760  350747.617798 16          3410  410567.848126 17           990  378433.018639 18          4730  239711.708869 19          4070  322307.968330 20          3080  495458.238811 21          2090  309996.247965 22          1320  592832.067579 23          2640  308990.318559 24          1540  427777.427815 25          4840  389056.668316 26          2860  132320.519487 27          2420  406715.767402 28           770  292968.918642 29          3520  145701.079372 30           660  343594.978075 31          3630  405034.547846 32          2310  412579.388504 33          2200  361173.288199 34          1870  401070.997685",
    },
    {
        "question": "What percentage of items were sold on promotion?",
        "sql_result": "   Promotion_Percentage 0              0.625596",
    },
    {
        "question": "What was the average transaction value?",
        "sql_result": "   Average_Transaction_Value 0                  19.018132",
    },
    {
        "question": "Create a line chart showing sales in 2021",
        "sql_result": "  sale_month  total_quantity_sold  total_sales_value 0 2021-11-01              43056.0      499984.428193 1 2021-12-01              75724.0      910982.118423",
    },
]

overall_experiment_questions[0]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[0]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[1]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[1]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[2]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[2]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[3]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[3]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[4]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[4]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[5]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[5]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[6]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[6]["question"], store_sales_df.columns, "sales"
)

print(overall_experiment_questions[6])

# overall_experiment_df = pd.DataFrame(overall_experiment_questions)

# dataset = px_client.upload_dataset(dataframe=overall_experiment_df, dataset_name="overall_experiment_questions_all", input_keys=["question"], output_keys=["sql_result"])

print(overall_experiment_questions[6])

[
    {
        "question": "What was the most popular product SKU?",
        "sql_result": "   SKU_Coded  Total_Qty_Sold 0    6200700         52262.0",
        "sql_generated": "```sql\nSELECT SKU_Coded, SUM(Qty_Sold) AS Total_Qty_Sold\nFROM sales\nGROUP BY SKU_Coded\nORDER BY Total_Qty_Sold DESC\nLIMIT 1;\n```",
    },
    {
        "question": "What was the total revenue across all stores?",
        "sql_result": "   Total_Revenue 0   1.327264e+07",
        "sql_generated": "```sql\nSELECT SUM(Total_Sale_Value) AS Total_Revenue\nFROM sales;\n```",
    },
    {
        "question": "Which store had the highest sales volume?",
        "sql_result": "   Store_Number  Total_Sales_Volume 0          2970             59322.0",
        "sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales_Volume\nFROM sales\nGROUP BY Store_Number\nORDER BY Total_Sales_Volume DESC\nLIMIT 1;\n```",
    },
    {
        "question": "Create a bar chart showing total sales by store",
        "sql_result": "    Store_Number    Total_Sales 0            880  420302.088397 1           1650  580443.007953 2           4180  272208.118542 3            550  229727.498752 4           1100  497509.528013 5           3300  619660.167018 6           3190  335035.018792 7           2970  836341.327191 8           3740  359729.808228 9           2530  324046.518720 10          4400   95745.620250 11          1210  508393.767785 12           330  370503.687331 13          2750  453664.808068 14          1980  242290.828499 15          1760  350747.617798 16          3410  410567.848126 17           990  378433.018639 18          4730  239711.708869 19          4070  322307.968330 20          3080  495458.238811 21          2090  309996.247965 22          1320  592832.067579 23          2640  308990.318559 24          1540  427777.427815 25          4840  389056.668316 26          2860  132320.519487 27          2420  406715.767402 28           770  292968.918642 29          3520  145701.079372 30           660  343594.978075 31          3630  405034.547846 32          2310  412579.388504 33          2200  361173.288199 34          1870  401070.997685",
        "sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nGROUP BY Store_Number;\n```",
    },
    {
        "question": "What percentage of items were sold on promotion?",
        "sql_result": "   Promotion_Percentage 0              0.625596",
        "sql_generated": "```sql\nSELECT \n    (SUM(CASE WHEN On_Promo = 'Yes' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS Promotion_Percentage\nFROM \n    sales;\n```",
    },
    {
        "question": "What was the average transaction value?",
        "sql_result": "   Average_Transaction_Value 0                  19.018132",
        "sql_generated": "```sql\nSELECT AVG(Total_Sale_Value) AS Average_Transaction_Value\nFROM sales;\n```",
    },
    {
        "question": "Create a line chart showing sales in 2021",
        "sql_result": "  sale_month  total_quantity_sold  total_sales_value 0 2021-11-01              43056.0      499984.428193 1 2021-12-01              75724.0      910982.118423",
        "sql_generated": "```sql\nSELECT MONTH(Sold_Date) AS Month, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nWHERE YEAR(Sold_Date) = 2021\nGROUP BY MONTH(Sold_Date)\nORDER BY MONTH(Sold_Date);\n```",
    },
]

CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one
that is vague, disorganized, or difficult to understand, even if it may be factually correct.

Your response should be a single word: either "clear" or "unclear," and it should not include any other
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and
appropriately addresses the query. "unclear" indicates that the answer is ambiguous, poorly organized, or
not effectively communicated. Please carefully consider the query and answer before determining your
response.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your
explanation. Your reasoning should include specific points about how the answer does or does not meet the
criteria for clarity.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""

ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to determine whether all
the entities mentioned in the answer are correctly identified and accurately match those in the query. An
entity refers to any specific person, place, organization, date, or other proper noun. Your evaluation
should focus on whether the entities in the answer are correctly named and appropriately associated with
the context in the query.

Your response should be a single word: either "correct" or "incorrect," and it should not include any
other text or characters. "correct" indicates that all entities mentioned in the answer match those in the
query and are properly identified. "incorrect" indicates that the answer contains errors or mismatches in
the entities referenced compared to the query.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "correct" or "incorrect." Avoid stating the final label at the beginning of
your explanation. Your reasoning should include specific points about how the entities in the answer do or
do not match the entities in the query.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating whether the entities in the answer are
correct and consistent with the query.
LABEL: "correct" or "incorrect"
"""

TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json.dumps(tools))

def function_calling_eval(input: str, output: str) -> float:
    function_calls = output.get("tool_calls")
    if function_calls:
        eval_df = pd.DataFrame(
            {"question": [input.get("question")] * len(function_calls), "tool_call": function_calls}
        )

        tool_call_eval = llm_classify(
            dataframe=eval_df,
            template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
                "{tool_definitions}", json.dumps(tools).replace("{", '"').replace("}", '"')
            ),
            rails=["correct", "incorrect"],
            model=eval_model,
            provide_explanation=True,
        )

        tool_call_eval["score"] = tool_call_eval.apply(
            lambda x: 1 if x["label"] == "correct" else 0, axis=1
        )
        return tool_call_eval["score"].mean()
    else:
        return 0


def code_is_runnable(output: str) -> bool:
    """Check if the code is runnable"""
    generated_code = output.get("tool_responses")
    if not generated_code:
        return True

    # Find first lookup_sales_data response
    generated_code = next(
        (r for r in generated_code if r.get("tool_name") == "generate_visualization"), None
    )
    if not generated_code:
        return True

    # Get the first response
    generated_code = generated_code.get("tool_response", "")
    generated_code = generated_code.strip()
    generated_code = generated_code.replace("```python", "").replace("```", "")
    try:
        exec(generated_code)
        return True
    except Exception:
        return False


def evaluate_sql_result(output, expected) -> bool:
    sql_result = output.get("tool_responses")
    if not sql_result:
        return True

    # Find first lookup_sales_data response
    sql_result = next((r for r in sql_result if r.get("tool_name") == "lookup_sales_data"), None)
    if not sql_result:
        return True

    # Get the first response
    sql_result = sql_result.get("tool_response", "")

    # Extract just the numbers from both strings
    result_nums = "".join(filter(str.isdigit, sql_result))
    expected_nums = "".join(filter(str.isdigit, expected.get("sql_result")))
    return result_nums == expected_nums


def evaluate_clarity(output: str, input: str) -> bool:
    df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
    response = llm_classify(
        dataframe=df,
        template=CLARITY_LLM_JUDGE_PROMPT,
        rails=["clear", "unclear"],
        model=eval_model,
        provide_explanation=True,
    )
    return response["label"] == "clear"


def evaluate_entity_correctness(output: str, input: str) -> bool:
    df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
    response = llm_classify(
        dataframe=df,
        template=ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT,
        rails=["correct", "incorrect"],
        model=eval_model,
        provide_explanation=True,
    )
    return response["label"] == "correct"

def run_overall_experiment(example: Example) -> str:
    with suppress_tracing():
        return run_agent_and_track_path_combined(example)


experiment = run_experiment(
    dataset,
    run_overall_experiment,
    evaluators=[
        function_calling_eval,
        evaluate_sql_result,
        evaluate_clarity,
        evaluate_entity_correctness,
        code_is_runnable,
    ],
    experiment_name="Overall Experiment",
    experiment_description="Evaluating the overall experiment",
)

Congratulations! 🎉

You've now evaluated every aspect of your agent. If you've made it this far, you're now an expert in evaluating agent routers, tools, and paths!