Only this pageAll pages
Powered by GitBook
1 of 24

Cookbooks

Loading...

Loading...

Loading...

Tracing

Loading...

Loading...

Prompt Engineering

Loading...

Loading...

Loading...

Loading...

Loading...

Datasets & Experiments

Loading...

Loading...

Loading...

Evaluation

Loading...

Loading...

Loading...

Loading...

Retrieval & Inferences

Loading...

Loading...

Agent Cookbooks

Tracing and Evaluating Agents

Featured Tutorials

Agent Demos

Example agents are fully instrumented with OpenInference and utilize end-to-end tracing with Phoenix for comprehensive performance analysis. Enter your Phoenix and OpenAI keys to view traces.

Cookbooks

Leverage the power of large language models to evaluate your generative model or application for hallucinations, toxicity, relevance of retrieved documents, and more.

Cookbooks

Iteratively improve your LLM task by building datasets, running experiments, and evaluating performance using code and LLM-as-a-judge.

Cookbooks

Use embeddings to explore lower-dimensional representations of your data, identifying clusters of high drift and performance degradation. Complement this with statistical analysis of structured data for A/B testing, temporal drift detection, and deeper performance insights.

Embeddings Analysis

Embedding Details

Embedding Drift Over Time

The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your inference set is diverging from a given reference baseline.

Moments of high euclidean distance is an indication that the primary inference set is starting to drift from the reference inference set. As the primary inferences move further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.

In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.

Clusters

When twos are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary inferences than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.

UMAP Point-Cloud

In addition to the point-cloud, another dimension we have at our disposal is color (and in some cases shape). Out of the box phoenix let's you assign colors to the UMAP point-cloud by dimension (features, tags, predictions, actuals), performance (correctness which distinguishes true positives and true negatives from the incorrect predictions), and inference (to highlight areas of drift). This helps you explore your point-cloud from different perspectives depending on what you are looking for.

Cookbooks

Trace through the execution of your LLM application to understand its internal structure and to troubleshoot issues with retrieval, tool execution, LLM calls, and more.

Cover

Evaluating Agents with Ragas

Create a customer support agent using the OpenAI Agents SDK, trace its interactions, and evaluate performance using Ragas.

Cover

Agents Cookbook

Build a customer support agent to trace activity, assess performance, and experiment with prompts and models.

Cover

Evaluating Agents with Ragas

For each described in the inference (s), Phoenix serves a embeddings troubleshooting view to help you identify areas of drift and performance degradation. Let's start with embedding drift.

Note that when you are troubleshooting search and retrieval using inferences, the euclidean distance of your queries to your knowledge base vectors is presented as query distance.

For an in-depth guide of euclidean distance and embedding drift, check out

Phoenix automatically breaks up your embeddings into groups of inferences using a clustering algorithm called . This is particularly useful if you are trying to identify areas of your embeddings that are drifting or performing badly.

Phoenix projects the embeddings you provided into lower dimensional space (3 dimensions) using a dimension reduction algorithm called (stands for Uniform Manifold Approximation and Projection). This lets us understand how your in a visually understandable way.

embedding
schema
corpus
Arze's ML course
HDBSCAN
UMAP
embeddings have encoded semantic meaning

Structured Data Extraction

Framework
Example notebook

Open AI Functions

Overview

Data extraction tasks using LLMs, such as scraping text from documents or pulling key information from paragraphs, are on the rise. Using an LLM for this task makes sense - LLMs are great at inherently capturing the structure of language, so extracting that structure from text using LLM prompting is a low cost, high scale method to pull out relevant data from unstructured text.

Structured Extraction at a Glance

LLM Input: Unstructured text + schema + system message

LLM Output: Response based on provided text + schema

Evaluation Metrics:

  1. Did the LLM extract the text correctly? (correctness)

One approach is using a flattened schema. Let's say you're dealing with extracting information for a trip planning application. The query may look something like:

User: I need a budget-friendly hotel in San Francisco close to the Golden Gate Bridge for a family vacation. What do you recommend?

As the application designer, the schema you may care about here for downstream usage could be a flattened representation looking something like:

{
    budget: "low",
    location: "San Francisco",
    purpose: "pleasure"
}

With the above extracted attributes, your downstream application can now construct a structured query to find options that might be relevant to the user.

Implementing a structured extraction application

parameters_schema = {
    "type": "object",
    "properties": {
        "location": {
            "type": "string",
            "description": 'The desired destination location. Use city, state, and country format when possible. If no destination is provided, return "unstated".',
        },
        "budget_level": {
            "type": "string",
            "enum": ["low", "medium", "high", "not_stated"],
            "description": 'The desired budget level. If no budget level is provided, return "not_stated".',
        },
        "purpose": {
            "type": "string",
            "enum": ["business", "pleasure", "other", "non_stated"],
            "description": 'The purpose of the trip. If no purpose is provided, return "not_stated".',
        },
    },
    "required": ["location", "budget_level", "purpose"],
}
function_schema = {
    "name": "record_travel_request_attributes",
    "description": "Records the attributes of a travel request",
    "parameters": parameters_schema,
}
system_message = (
    "You are an assistant that parses and records the attributes of a user's travel request."
)

The ChatCompletion call to Open AI would look like

response = openai.ChatCompletion.create(
    model=model,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": travel_request},
    ],
    functions=[function_schema],
    # By default, the LLM will choose whether or not to call a function given the conversation context.
    # The line below forces the LLM to call the function so that the output conforms to the schema.
    function_call={"name": function_schema["name"]},
)

Inspecting structured extraction with Phoenix

You can use phoenix spans and traces to inspect the invocation parameters of the function to

  1. verify the inputs to the model in form of the the user message

  2. verify your request to Open AI

  3. verify the corresponding generated outputs from the model match what's expected from the schema and are correct

Evaluating the Extraction Performance

Point level evaluation is a great starting point, but verifying correctness of extraction at scale or in a batch pipeline can be challenging and expensive. Evaluating data extraction tasks performed by LLMs is inherently challenging due to factors like:

  • The diverse nature and format of source data.

  • The potential absence of a 'ground truth' for comparison.

  • The intricacies of context and meaning in extracted data.

OpenAI Agents SDK Cookbook

Create an agent with the OpenAI Agents SDK, trace its activity, benchmark with datasets, run experiments, and evaluate traces in production.

Evaluate an Agent

Trace and evaluate a "talk-to-your-data" agent. Includes evaluations for function calling accuracy, SQL query generation, code generation, and agent execution path.

Cover
Cover

OpenAI Agents SDK Cookbook

Evaluating an Agent

Cover
Cover
Euclidean distance over time
Centroids of the two inferences are used to calculate euclidean and cosine distance
Click on a particular time to view why the inference embeddings are drifting
Color by inferences vs color by correctness vs color by prediction for a computer vision model

Structured extraction is a place where it’s simplest to work directly with the . Open AI functions for structured data extraction recommends providing the following JSON schema object in the form ofparameters_schema(the desired fields for structured data output).

To learn more about how to evaluate structured extraction applications, !

OpenAI function calling API
head to our documentation on LLM assisted evals
Cover

Tracing and Evaluating Amazon Bedrock Agents

Build an Amazon Bedrock agent, instrument and trace it with Phoenix, and add evaluations to your agent traces.

Cover

Tracing and Evaluating a LangChain OpenAI Agent

Build your own LangChain OpenAI agent using the function-calling API and inspect the agent's internals—all in a minimal setup with conversation and tool use.

Cover

Tracing and Evaluating a LlamaIndex OpenAI Agent

Use the function-calling API to create a LlamaIndex OpenAI agent capable of conversation and tool use, and explore its behavior with Phoenix.

Cover

Code Generation Agent

Explore a Code Generator Copilot Agent designed to generate, optimize, and validate code.

Cover

Computer Use Agent

Test out a Computer Use (Operator) Agent built to execute commands, edit files, and manage system operations.

Cover

RAG Agent

Enter a source URL and collect traces in Phoenix to see how a RAG Agent can retrieve and generate accurate responses.

Agents Cookbook

Chatbot with User Feedback

or

Cover
Cover
Cover

Evaluating a Search and Retrieval Application

Cover

LLM Ops - Tracing, Evaluation, and Analysis

Cover

Prompt Optimization

Cover

Optimizing LLM as a Judge Prompts

Cover

Embeddings Analysis: Data Exploration

Cover

RAG Use Cases

Python
TypeScript

Evaluations Use Cases

Evaluating and Improving RAG Applications

Common Evaluations

Cover
Cover
Cover
Cover

Structured Data Analysis

Cover

Embeddings Analysis: Model Performance

Analyze GPT Prompt and Response Data
Find Clusters, Export, and Explore with GPT
Cover

Comprehensive Use Cases

Answer and Context Relevancy Evals
Response Guideline Evals
LlamaIndex RAG with Reranker
Cover

Tracing with Sessions

Cover

Tracing Applications

Cover

Tracing Use Cases

Retrieved Document Relevance
Code Readability Evals
Question-Answering Evals
Evaluating Agents using Ragas
End-to-End RAG Application Evaluation
LlamaIndex Application
LlamaIndex Application using Milvus Vector Store
Hallucination Evals
Toxicity Evals
Summarization Evals
Text2SQL
Summarization Service
Email Text Extraction
Pairwise Evaluator
Detecting Fraud with Tabular Embeddings
Active Learning for a Drifting Image Classification Model
Root-Cause Analysis for a Drifting Sentiment Classification Model
Troubleshooting an LLM Summarization Task
OpenAI (Python)
OpenAI (JS/TS)
LlamaIndex
OpenAI Application
LlamaIndex Application
DSPy Application
Haystack Application
Groq Application
CrewAI Application
LangChain OpenAI Agent
LlamaIndex OpenAI Agent
OpenAI Structured Data Extraction Service
RAG Chatbot Application
LangChain + OpenAI RAG Application
LlamaIndex + OpenAI RAG Application
Verifying an individual trace invocation parameters

Few Shot Prompting

Few-shot prompting is a powerful technique in prompt engineering that helps LLMs perform tasks more effectively by providing a few examples within the prompt.

Unlike zero-shot prompting, where the model must infer the task with no prior context, or one-shot prompting, where a single example is provided, few-shot prompting leverages multiple examples to guide the model’s responses more accurately.

In this tutorial you will:

  • Explore how different prompting strategies impact performance in a sentiment analysis task on a dataset of reviews.

  • Run an evaluation to measure how the prompt affects the model’s performance

  • Track your how your prompt and experiment changes overtime in Phoenix

By the end of this tutorial, you’ll have a clear understanding of how structured prompting can significantly enhance the results of any application.

⚠️You will need an OpenAI Key for this tutorial.

Let’s get started! 🚀

Setup Dependencies and Keys

!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset Into Phoenix

This dataset contains reviews along with their corresponding sentiment labels. Throughout this notebook, we will use the same dataset to evaluate the impact of different prompting techniques, refining our approach with each iteration.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

from datasets import load_dataset

ds = load_dataset("syeddula/fridgeReviews")["train"]
ds = ds.to_pandas()
ds.head()

Set up Phoenix Client

import uuid

import phoenix as px
from phoenix.client import Client as PhoenixClient

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Review"],
    output_keys=["Sentiment"],
    dataset_name=f"review-classification-{unique_id}",
)

Zero-Shot Prompting

Zero-shot prompting is a technique where a language model is asked to perform a task without being given any prior examples. Instead, the model relies solely on its pre-trained knowledge to generate a response. This approach is useful when you need quick predictions without providing specific guidance.

In this section, we will apply zero-shot prompting to our sentiment analysis dataset, asking the model to classify reviews as positive, negative, or neutral without any labeled examples. We’ll then evaluate its performance to see how well it can infer the task based on the prompt alone.

from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

from phoenix.client.types import PromptVersion

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications.",
        },
        {"role": "user", "content": "{{Review}}"},
    ],
)

prompt_identifier = "fridge-sentiment-reviews"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="A prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.

Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.

Next we will define a task and evaluator for the experiment.

Because our dataset has ground truth labels, we can use a simple function to check if the output of the task matches the expected output.

def zero_shot_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(**prompt.format(variables={"Review": input["Review"]}))
    return resp.choices[0].message.content.strip()


def evaluate_response(output, expected):
    return output.lower() == expected["Sentiment"].lower()

If you’d like to instrument your code, you can run the cell below. While this step isn’t required for running prompts and evaluations, it enables trace visualization for deeper insights into the model’s behavior.

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register(project_name="few-shot-examples")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Finally, we run our experiement. We can view the results of the experiement in Phoenix.

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset,
    task=zero_shot_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot Prompt",
    experiment_name="zero-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

In the following sections, we refine the prompt to enhance the model's performance and improve the evaluation results on our dataset.

One-Shot Prompting

One-shot prompting provides the model with a single example to guide its response. By including a labeled example in the prompt, we give the model a clearer understanding of the task, helping it generate more accurate predictions compared to zero-shot prompting.

In this section, we will apply one shot prompting to our sentiment analysis dataset by providing one labeled review as a reference. We’ll then evaluate how this small amount of guidance impacts the model’s ability to classify sentiments correctly.

ds = load_dataset("syeddula/fridgeReviews")["test"]
one_shot_example = ds.to_pandas().sample(1)
one_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."

Here is one example of a review and the sentiment:

{examples}
"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": one_shot_template.format(examples=one_shot_example)},
        {"role": "user", "content": "{{Review}}"},
    ],
)

one_shot_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="One-shot prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

Under the prompts tab in Phoenix, we can see that our prompt has an updated version. The prompt includes one random example from the test dataset to help the model make its classification.

Similar to the previous step, we will define the task and run the evaluator. This time, we will be using our updated prompt for One Shot Prompting and see how the evaluation changes.

def one_shot_prompt_template(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **one_shot_prompt.format(variables={"Review": input["Review"]})
    )
    return resp.choices[0].message.content.strip()
one_shot_experiment = run_experiment(
    dataset,
    task=one_shot_prompt_template,
    evaluators=[evaluate_response],
    experiment_description="One-Shot Prompting",
    experiment_name="one-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + one_shot_prompt.id},
)

In this run, we observe a slight improvement in the evaluation results. Let’s see if we can further enhance performance in the next section.

Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM, the randomness of selected test examples, and other inherent model behaviors.

Few-Shot Prompting

Finally, we will explore few-shot Prompting which enhances a model’s performance by providing multiple labeled examples within the prompt. By exposing the model to several instances of the task, it gains a better understanding of the expected output, leading to more accurate and consistent responses.

In this section, we will apply few-shot prompting to our sentiment analysis dataset by including multiple labeled reviews as references. This approach helps the model recognize patterns and improves its ability to classify sentiments correctly. We’ll then evaluate its performance to see how additional examples impact accuracy compared to zero-shot and one-shot prompting.

ds = load_dataset("syeddula/fridgeReviews")["test"]
few_shot_examples = ds.to_pandas().sample(10)
few_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."

Here are examples of a review and the sentiment:

{examples}
"""
params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": few_shot_template.format(examples=few_shot_examples)},
        {"role": "user", "content": "{{Review}}"},
    ],
)

few_shot_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Few-shot prompt for classifying reviews based on sentiment.",
    version=PromptVersion.from_openai(params),
)

Our updated prompt also lives in Phoenix. We can clearly see how the linear version history of our prompt was built.

Just like previous steps, we run our task and evaluation.

def few_shot_prompt_template(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **few_shot_prompt.format(variables={"Review": input["Review"]})
    )
    return resp.choices[0].message.content.strip()
few_shot_experiment = run_experiment(
    dataset,
    task=few_shot_prompt_template,
    evaluators=[evaluate_response],
    experiment_description="Few Shot Prompting",
    experiment_name="few-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + few_shot_prompt.id},
)

Final Results

In this final run, we observe the most significant improvement in evaluation results. By incorporating multiple examples into our prompt, we provide clearer guidance to the model, leading to better sentiment classification.

Note: Performance may still vary, and in some cases, results might decline. Like before, this is not necessarily "wrong," as factors like the choice of LLM, the randomness of selected test examples, and inherent model behaviors can all influence outcomes.

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

From here, you can check out more , and if you haven't already, ⭐️

connect to a self-hosted Phoenix instance
examples on Phoenix
please give us a star on GitHub!

Summarization

Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.

In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:

  • Upload a dataset of examples containing articles and human-written reference summaries to Phoenix

  • Define an experiment task that summarizes a news article

  • Devise evaluators for length and ROUGE score

  • Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs

⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.

Let's get started!

Install Dependencies and Import Libraries

Install requirements and import libraries.

pip install anthropic "arize-phoenix>=4.6.0" openai openinference-instrumentation-openai rouge tiktoken
from typing import Any, Dict

import nest_asyncio
import pandas as pd

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", None)  # display full cells of dataframes

Launch Phoenix

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.

import phoenix as px

px.launch_app()

Instrument Your Application

from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Create Your Dataset

Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.

from datetime import datetime

from datasets import load_dataset

hf_ds = load_dataset("abisee/cnn_dailymail", "3.0.0")
df = (
    hf_ds["test"]
    .to_pandas()
    .sample(n=10, random_state=0)
    .set_index("id")
    .rename(columns={"highlights": "summary"})
)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
dataset = px.Client().upload_dataset(
    dataframe=df,
    input_keys=["article"],
    output_keys=["summary"],
    dataset_name=f"news-article-summaries-{now}",
)

Define Your Experiment Task

A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.

You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.

import os
from getpass import getpass

if os.environ.get("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

Next, define a function to format a prompt template and invoke an OpenAI model on an example.

from openai import AsyncOpenAI
from phoenix.experiments import Example

openai_client = AsyncOpenAI()


async def summarize_article_openai(example: Example, prompt_template: str, model: str) -> str:
    formatted_prompt_template = prompt_template.format(article=example.input["article"])
    response = await openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "assistant", "content": formatted_prompt_template},
        ],
    )
    assert response.choices
    return response.choices[0].message.content

From this function, you can use functools.partial to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.

import textwrap
from functools import partial

template = """
Summarize the article in two to four sentences:

ARTICLE
=======
{article}

SUMMARY
=======
"""
gpt_4o = "gpt-4o-2024-05-13"
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
test_example = dataset.examples[0]
print(textwrap.fill(await task(test_example), width=100))

Define Your Evaluators

Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:

  • ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)

  • ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)

  • ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.

Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.

Note that you can use any third-party library you like while defining evaluators (in your case, rouge and tiktoken).

import tiktoken
from rouge import Rouge


# convenience functions
def _rouge_1(hypothesis: str, reference: str) -> Dict[str, Any]:
    scores = Rouge().get_scores(hypothesis, reference)
    return scores[0]["rouge-1"]


def _rouge_1_f1_score(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["f"]


def _rouge_1_precision(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["p"]


def _rouge_1_recall(hypothesis: str, reference: str) -> float:
    return _rouge_1(hypothesis, reference)["r"]


# evaluators
def rouge_1_f1_score(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_f1_score(hypothesis=output, reference=expected["summary"])


def rouge_1_precision(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_precision(hypothesis=output, reference=expected["summary"])


def rouge_1_recall(output: str, expected: Dict[str, Any]) -> float:
    return _rouge_1_recall(hypothesis=output, reference=expected["summary"])


def num_tokens(output: str) -> int:
    encoding = tiktoken.encoding_for_model(gpt_4o)
    return len(encoding.encode(output))


EVALUATORS = [rouge_1_f1_score, rouge_1_precision, rouge_1_recall, num_tokens]

Run Experiments and Iterate on Your Prompt Template

Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.

from phoenix.experiments import run_experiment

experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="initial-template",
    experiment_description="first experiment using a simple prompt template",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.

template = """
Summarize the article in two to four sentences. Be concise and include only the most important information.

ARTICLE
=======
{article}

SUMMARY
=======
"""
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="concise-template",
    experiment_description="explicitly instuct the llm to be concise",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

Inspecting the experiment results, you'll notice that the average num_tokens has indeed increased, but the generated summaries are still far more verbose than the reference summaries.

Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.

# examples to include (not included in the uploaded dataset)
train_df = (
    hf_ds["train"]
    .to_pandas()
    .sample(n=5, random_state=42)
    .head()
    .rename(columns={"highlights": "summary"})
)

example_template = """
ARTICLE
=======
{article}

SUMMARY
=======
{summary}
"""

examples = "\n".join(
    [
        example_template.format(article=row["article"], summary=row["summary"])
        for _, row in train_df.iterrows()
    ]
)

template = """
Summarize the article in two to four sentences. Be concise and include only the most important information, as in the examples below.

EXAMPLES
========

{examples}


Now summarize the following article.

ARTICLE
=======
{article}

SUMMARY
=======
"""

template = template.format(
    examples=examples,
    article="{article}",
)
print(template)

Now run the experiment.

task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="few-shot-template",
    experiment_description="include examples",
    experiment_metadata={"vendor": "openai", "model": gpt_4o},
    evaluators=EVALUATORS,
)

By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.

Compare With Another Model (Optional)

⚠️ This section requires an Anthropic API key.

Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.

First, enter your Anthropic API key if it is not already present.

import os
from getpass import getpass

if os.environ.get("ANTHROPIC_API_KEY") is None:
    os.environ["ANTHROPIC_API_KEY"] = getpass("🔑 Enter your Anthropic API key: ")

Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.

from anthropic import AsyncAnthropic

client = AsyncAnthropic()


async def summarize_article_anthropic(example: Example, prompt_template: str, model: str) -> str:
    formatted_prompt_template = prompt_template.format(article=example.input["article"])
    message = await client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": formatted_prompt_template}],
    )
    return message.content[0].text


claude_35_sonnet = "claude-3-5-sonnet-20240620"
task = partial(summarize_article_anthropic, prompt_template=template, model=claude_35_sonnet)

experiment_results = run_experiment(
    dataset,
    task,
    experiment_name="anthropic-few-shot",
    experiment_description="anthropic",
    experiment_metadata={"vendor": "anthropic", "model": claude_35_sonnet},
    evaluators=EVALUATORS,
)

If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:

Here is a concise 3-sentence summary of the article...

See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.

Synopsis and Next Steps

Congrats! In this tutorial, you have:

  • Created a Phoenix dataset

  • Defined an experimental task and custom evaluators

  • Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall

ReAct Prompting

ReAct (Reasoning + Acting) is a prompting technique that enables LLMs to think step-by-step before taking action. Unlike traditional prompting, where a model directly provides an answer, ReAct prompts guide the model to reason through a problem first, then decide which tools or actions are necessary to reach the best solution.

ReAct is ideal for situations that require multi-step problem-solving with external tools. It also improves transparency by clearly showing the reasoning behind each tool choice, making it easier to understand and refine the model's actions.

In this tutorial, you will:

  • Learn how to craft prompts, tools, and evaluators in Phoenix

  • Refine your prompts to understand the power of ReAct prompting

  • Leverage Phoenix and LLM as a Judge techniques to evaluate accuracy at each step, gaining insight into the model's thought process.

  • Learn how to apply ReAct prompting in real-world scenarios for improved task execution and problem-solving.

⚠️ You'll need an OpenAI Key for this tutorial.

Let’s get started! 🚀

Set up Dependencies and Keys

Instrument Application

Load Dataset Into Phoenix

This dataset contains 20 customer service questions that a customer might ask a store's chatbot. As we dive into ReAct prompting, we'll use these questions to guide the LLM in selecting the appropriate tools.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

After running this cell, the dataset should will be under the Datasets tab in Phoenix.

Define Tools

Next, let's define the tools available for the LLM to use. We have five tools at our disposal, each serving a specific purpose: Product Comparison, Product Details, Discounts, Customer Support, and Track Package.

Depending on the customer's question, the LLM will determine the optimal sequence of tools to use.

Initial Prompt

Let's start by defining a simple prompt that instructs the system to utilize the available tools to answer the questions. The choice of which tools to use, and how to apply them, is left to the model's discretion based on the context of each customer query.

At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.

Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.

This prompt is provided to the LLM-as-Judge model, which takes in both the user's query and the tools the system has selected. The model then uses reasoning to assess how effectively the chosen tools addressed the query, providing an explanation for its evaluation.

In the following cells, we will define a task for the experiment. Then, in the evaluate_response function, we define our LLM as a Judge evaluator. Finally, we run our experiment.

Experiment

After running our experiment and evaluation, we can dive deeper into the results. By clicking into the experiment, we can explore the tools that the LLM selected for the specific input. Next, if we click on the trace for the evaluation, we can see the reasoning behind the score assigned by LLM as a Judge for the output.

ReAct Prompt

Next, we iterate on our system prompt using ReAct Prompting techniques. We emphasize that the model should think through the problem step-by-step, break it down logically, and then determine which tools to use and in what order. The model is instructed to output the relevant tools along with their corresponding parameters.

This approach differs from our initial prompt because it encourages reasoning before action, guiding the model to select the best tools and parameters based on the specific context of the query, rather than simply using predefined actions.

In the Prompts tab, you will see the updated prompt. As you iterate, you can build a version history.

Just like above, we define our task, construct the evaluator, and run the experiment.

Experiment

With our updated ReAct prompt, we can observe that the LLM as a Judge Evaluator rated more outputs as correct. By clicking into the traces, we can gain insights into the reasons behind this improvement. By prompting our LLM to be more thoughtful and purposeful, we can see the reasoning and acting aspects of ReAct.

You can explore the evaluators outputs to better understand the improvements in detail.

Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.

To refine and test these prompts against other datasets, experiment with alternative techniques like Chain of Thought (CoT) prompting to assess how they complement or contrast with ReAct in your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using both the TypeScript and Python Clients.

OpenAI Agents SDK Cookbook

This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:

  • Create an agent using the OpenAI agents SDK

  • Trace the agent activity

  • Create a dataset to benchmark performance

  • Run an experiment to evaluate agent performance using LLM as a judge

  • Learn how to evaluate traces in production

Initial setup

Install Libraries

Setup Dependencies and Keys

Setup Tracing

Create your first agent with the OpenAI SDK

Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool.

We'll use the Runner class to run the agent and get the final output.

Now we have a basic agent, let's evaluate whether the agent responded correctly!

Evaluating our agent

Agents can go awry for a variety of reasons.

  1. Tool call accuracy - did our agent choose the right tool with the right arguments?

  2. Tool call results - did the tool respond with the right results?

  3. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?

Let's setup our evaluation by defining our task function, our evaluator, and our dataset.

Next, we create our evaluator.

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.

Experiment in Development

During development, experimentation helps iterate quickly by revealing agent failures during evaluation. You can test against datasets to refine prompts, logic, and tool usage before deploying.

In this section, we run our agent against the dataset defined above and evaluate for correctness using LLM as Judge.

Create an experiment

With our dataset of questions we generated above, we can use our experiment feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

View Traces in Phoenix

Evaluating in Production

In production, evaluation provides real-time insights into how agents perform on user data.

This section simulates a live production setting, showing how you can collect traces, model outputs, and evaluation results in real time.

Another option is to pull traces from completed production runs and batch process evaluations on them. You can then log the results of those evaluations in Phoenix.

After importing the necessary libraries, we set up a tracer object to enable span creation for tracing our task function.

Next, we update our correctness evaluator to return both a label and an explanation, enabling metadata to be captured during tracing.

We also revise the task function to include with clauses that generate structured spans in Phoenix. These spans capture key details such as input values, output values, and the results of the evaluation.

Finally, we run an experiment to simulate traces in production.

View Traces and Evaluator Results in Phoenix as Traces Populate

Download your from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.

Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a .

As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the to train a bespoke model for your needs.

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

Next, we will define the Tool Calling Prompt Template. In this step, we use to evaluate the output. LLM as a Judge is a technique where one LLM assesses the performance of another LLM.

From here, you can check out more , and if you haven't already, ⭐️

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals .

data
model fine-tuned on this particular dataset achieved a rouge score of ~0.44
OpenAI fine-tuning API
!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
import nest_asyncio
import pandas as pd
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from openinference.instrumentation.openai import OpenAIInstrumentor

import phoenix as px
from phoenix.client import Client as PhoenixClient
from phoenix.client.types import PromptVersion
from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    OpenAIModel,
    llm_classify,
)
from phoenix.experiments import run_experiment
from phoenix.otel import register

nest_asyncio.apply()
tracer_provider = register(
    project_name="ReAct-examples", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
from datasets import load_dataset

ds = load_dataset("syeddula/customer_questions")["train"]
ds = ds.to_pandas()
ds.head()
import uuid

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Questions"],
    dataset_name=f"customer-questions-{unique_id}",
)
tools = [
    {
        "type": "function",
        "function": {
            "name": "product_comparison",
            "description": "Compare features of two products.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_a_id": {
                        "type": "string",
                        "description": "The unique identifier of Product A.",
                    },
                    "product_b_id": {
                        "type": "string",
                        "description": "The unique identifier of Product B.",
                    },
                },
                "required": ["product_a_id", "product_b_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "product_details",
            "description": "Get detailed features on one product.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The unique identifier of the Product.",
                    }
                },
                "required": ["product_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "apply_discount_code",
            "description": "Checks for discounts and promotions. Applies a discount code to an order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "integer",
                        "description": "The unique identifier of the order.",
                    },
                    "discount_code": {
                        "type": "string",
                        "description": "The discount code to apply.",
                    },
                },
                "required": ["order_id", "discount_code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "customer_support",
            "description": "Get contact information for customer support regarding an issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "issue_type": {
                        "type": "string",
                        "description": "The type of issue (e.g., billing, technical support).",
                    }
                },
                "required": ["issue_type"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package.",
                    }
                },
                "required": ["tracking_number"],
            },
        },
    },
]
params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="auto",
    messages=[
        {
            "role": "system",
            "content": """You are a helpful customer service agent.
            Your task is to determine the best tools to use to answer a customer's question.
            Output the tools and pick 3 tools at maximum.
            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support",
    version=PromptVersion.from_openai(params),
)
TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would reasonably help answer the question.
The tool calls have been generated by a separate agent, chosen from the list of
tools provided below. Your job is to decide whether that agent's response was relevant to solving the customer's question.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_calls}
    [END DATA]

Your response must be one of the following:
1. **"correct"** – The chosen tool(s) would sufficiently answer the question.
2. **"mostly_correct"** – The tool(s) are helpful, but a better selection could have been made (at most 1 missing or unnecessary tool).
3. **"incorrect"** – The tool(s) would not meaningfully help answer the question.

Explain why you made your choice.

    [Tool Definitions]:
    product_comparison: Compare features of two products.
    product_details: Get detailed features on one product.
    apply_discount_code: Applies a discount code to an order.
    customer_support: Get contact information for customer support regarding an issue.
    track_package: Track the status of a package based on the tracking number.
"""
def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score
initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="initial-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="required",
    messages=[
        {
            "role": "system",
            "content": """
              You are a helpful customer service agent. Carefully analyze the customer’s question to fully understand their request.
              Step 1: Think step-by-step. Identify the key pieces of information needed to answer the question. Consider any dependencies between these pieces of information.
              Step 2: Decide which tools to use. Choose up to 3 tools that will best retrieve the required information. If multiple tools are needed, determine the correct order to call them.
              Step 3: Output the chosen tools and any relevant parameters.

            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support ReAct Prompt",
    version=PromptVersion.from_openai(params),
)
def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score
initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="improved-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
!pip install -q "arize-phoenix>=8.0.0" openinference-instrumentation-openai-agents openinference-instrumentation-openai --upgrade
!pip install -q openai nest_asyncio openai-agents
import os

import nest_asyncio

nest_asyncio.apply()

from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
from phoenix.otel import register

# Setup Tracing
tracer_provider = register(
    project_name="openai-agents-cookbook",
    endpoint="https://app.phoenix.arize.com/v1/traces",
    auto_instrument=True,
)
from agents import Runner, function_tool


@function_tool
def solve_equation(equation: str) -> str:
    """Use python to evaluate the math equation, instead of thinking about it yourself.

    Args:
       equation: string which to pass into eval() in python
    """
    return str(eval(equation))
from agents import Agent

agent = Agent(
    name="Math Solver",
    instructions="You solve math problems by evaluating them with python and returning the result",
    tools=[solve_equation],
)
result = await Runner.run(agent, "what is 15 + 28?")

# Run Result object
print(result)

# Get the final output
print(result.final_output)

# Get the entire list of messages recorded to generate the final output
print(result.to_input_list())
import asyncio


# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.
async def solve_math_problem(dataset_row: dict):
    result = await Runner.run(agent, dataset_row.get("question"))
    return {
        "final_output": result.final_output,
        "messages": result.to_input_list(),
    }


dataset_row = {"question": "What is 15 + 28?"}

result = asyncio.run(solve_math_problem(dataset_row))
print(result)
import pandas as pd

from phoenix.evals import OpenAIModel, llm_classify


def correctness_eval(input, output):
    # Template for evaluating math problem solutions
    MATH_EVAL_TEMPLATE = """
    You are evaluating whether a math problem was solved correctly.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]

    Assess if the answer to the math problem is correct. First work out the correct answer yourself,
    then compare with the provided response. Consider that there may be different ways to express the same answer
    (e.g., "43" vs "The answer is 43" or "5.0" vs "5").

    Your answer must be a single word, either "correct" or "incorrect"
    """

    # Run the evaluation
    rails = ["correct", "incorrect"]
    eval_df = llm_classify(
        data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
        template=MATH_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=rails,
        provide_explanation=True,
    )
    label = eval_df["label"][0]
    score = 1 if label == "correct" else 0
    return score
MATH_GEN_TEMPLATE = """
You are an assistant that generates diverse math problems for testing a math solver agent.
The problems should include:

Basic Operations: Simple addition, subtraction, multiplication, division problems.
Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.
Exponents and Roots: Problems involving powers, square roots, and other nth roots.
Percentages: Problems involving calculating percentages of numbers or finding percentage changes.
Fractions: Problems with addition, subtraction, multiplication, or division of fractions.
Algebra: Simple algebraic expressions that can be evaluated with specific values.
Sequences: Finding sums, products, or averages of number sequences.
Word Problems: Converting word problems into mathematical equations.

Do not include any solutions in your generated problems.

Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.
Generate 25 diverse math problems. Ensure there are no duplicate problems.
"""
import nest_asyncio

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 500)

# Initialize the model
model = OpenAIModel(model="gpt-4o", max_tokens=1300)

# Generate math problems
resp = model(MATH_GEN_TEMPLATE)

# Create DataFrame
split_response = resp.strip().split("\n")
math_problems_df = pd.DataFrame(split_response, columns=["question"])
print(math_problems_df.head())
import uuid

import phoenix as px

unique_id = uuid.uuid4()

dataset_name = "math-questions-" + str(uuid.uuid4())[:5]

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=math_problems_df,
    input_keys=["question"],
    dataset_name=f"math-questions-{unique_id}",
)
print(dataset)
from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    dataset,
    task=solve_math_problem,
    evaluators=[correctness_eval],
    experiment_description="Solve Math Problems",
    experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)
!pip install openinference-instrumentation
from opentelemetry.trace import StatusCode, format_span_id

from phoenix.trace import SpanEvaluations
tracer = tracer_provider.get_tracer(__name__)
# This is our modified correctness evaluator.
def correctness_eval(input, output):
    # Template for evaluating math problem solutions
    MATH_EVAL_TEMPLATE = """
    You are evaluating whether a math problem was solved correctly.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]

    Assess if the answer to the math problem is correct. First work out the correct answer yourself,
    then compare with the provided response. Consider that there may be different ways to express the same answer
    (e.g., "43" vs "The answer is 43" or "5.0" vs "5").

    Your answer must be a single word, either "correct" or "incorrect"
    """

    # Run the evaluation
    rails = ["correct", "incorrect"]
    eval_df = llm_classify(
        data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
        template=MATH_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=rails,
        provide_explanation=True,
    )

    return eval_df
# This is our modified task function.
async def solve_math_problem(dataset_row: dict):
    with tracer.start_as_current_span(name="agent", openinference_span_kind="agent") as agent_span:
        question = dataset_row.get("question")
        agent_span.set_input(question)
        agent_span.set_status(StatusCode.OK)

        result = await Runner.run(agent, question)
        agent_span.set_output(result.final_output)

        task_result = {
            "final_output": result.final_output,
            "messages": result.to_input_list(),
        }

        # Evaluation span for correctness
        with tracer.start_as_current_span(
            "correctness-evaluator",
            openinference_span_kind="evaluator",
        ) as eval_span:
            evaluation_result = correctness_eval(dataset_row, task_result)
            eval_span.set_attribute("eval.label", evaluation_result["label"][0])
            eval_span.set_attribute("eval.explanation", evaluation_result["explanation"][0])

        # Logging our evaluation
        span_id = format_span_id(eval_span.get_span_context().span_id)
        score = 1 if evaluation_result["label"][0] == "correct" else 0
        eval_data = {
            "span_id": span_id,
            "label": evaluation_result["label"][0],
            "score": score,
            "explanation": evaluation_result["explanation"][0],
        }
        df = pd.DataFrame([eval_data])
        px.Client().log_evaluations(
            SpanEvaluations(
                dataframe=df,
                eval_name="correctness",
            ),
        )

    return task_result


dataset_row = {"question": "What is 15 + 28?"}

result = asyncio.run(solve_math_problem(dataset_row))
print(result)
from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    dataset,
    task=solve_math_problem,
    experiment_description="Solve Math Problems",
    experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)
connect to a self-hosted Phoenix instance
LLM as a Judge
examples on Phoenix
please give us a star on GitHub!
connect to a self-hosted Phoenix instance
here

Text2SQL

Let's work through a Text2SQL use case where we are starting from scratch without a nice and clean dataset of questions, SQL queries, or expected responses.

pip install 'arize-phoenix>=4.6.0' openai duckdb datasets pyarrow pydantic nest_asyncio openinference-instrumentation-openai --quiet

Let's first start a phoenix server. Note that this is not necessary if you have a phoenix server running already.

import phoenix as px

px.launch_app()

Let's also setup tracing for OpenAI as we will be using their API to perform the synthesis.

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Let's make sure we can run async code in the notebook.

import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our openai API key set up.

import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

Download Data

We are going to use the NBA dataset that information from 2014 - 2018. We will use DuckDB as our database.

import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

conn.query("SELECT * FROM nba LIMIT 5").to_df().to_dict(orient="records")[0]

Implement Text2SQL

Let's start by implementing a simple text2sql logic.

import os

import openai

client = openai.AsyncClient()

columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}


system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n"
    f"{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n"
    "Write a SQL query corresponding to the user's request. Return just the query text, "
    "with no formatting (backticks, markdown, etc.)."
)


async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content
query = await generate_query("Who won the most games?")
print(query)

Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.

def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


execute_query(query)

Evaluation

Evaluation consists of three parts — data, task, and scores. We'll start with data.

questions = [
    "Which team won the most games?",
    "Which team won the most games in 2015?",
    "Who led the league in 3 point shots?",
    "Which team had the biggest difference in records across two consecutive years?",
    "What is the average number of free throws per year?",
]

Let's store the data above as a versioned dataset in phoenix.

import pandas as pd

ds = px.Client().upload_dataset(
    dataset_name="nba-questions",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="nba-questions")

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

async def text2sql(question):
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.

# Test if there are no sql execution errors
def no_error(output):
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the evaluation experiment.

import phoenix as px
from phoenix.experiments import run_experiment


# Define the task to run text2sql on the input question
def task(input):
    return text2sql(input["question"])


experiment = run_experiment(
    ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

Ok! It looks like 3/5 of our queries are valid.

Interpreting the results

Now that we ran the initial evaluation, it looks like two of the results are valid, two produce SQL errors, and one is incorrect.

  • The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model (e.g. few shot example).

  • There are is a binder error, which may also have to do with not understanding the data format.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

samples = conn.query("SELECT * FROM nba LIMIT 1").to_df().to_dict(orient="records")[0]
sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}"
    for column in columns
)
system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{sample_rows}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content


print(await generate_query("Which team won the most games in 2015?"))

Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.

experiment = run_experiment(
    ds, task=task, evaluators=[has_results, no_error], experiment_metadata=CONFIG
)

Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.

from phoenix.evals.models import OpenAIModel
from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators.llm_evaluators import LLMCriteriaEvaluator

llm_evaluator = LLMCriteriaEvaluator(
    name="is_sql",
    criteria="is_sql",
    description="the output is a valid SQL query and that it executes without errors",
    model=OpenAIModel(),
)

evaluate_experiment(experiment, evaluators=[llm_evaluator])

Sure enough the LLM agrees with our scoring. Pretty neat trick! This can come in useful when it's difficult to define a scoring function.

We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in Golden Dataset for regression testing!

Generating more data

Now that we have a basic flow in place, let's generate some data. We're going to use the dataset itself to generate expected queries, and have a model describe the queries. This is a slightly more robust method than having it generate queries, because we'd expect a model to describe a query more accurately than generate one from scratch.

import json

from pydantic import BaseModel


class Question(BaseModel):
    sql: str
    question: str


class Questions(BaseModel):
    questions: list[Question]


synthetic_data_prompt = f"""\
You are a SQL expert, and you are given a single table named nba with the following columns:

Column | Type | Example
-------|------|--------
{"\n".join(f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}" for column in columns)}

Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as the
question that the query answers."""

response = await client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": synthetic_data_prompt,
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "generate_questions",
                "description": "Generate SQL queries that would be interesting to ask about this table.",
                "parameters": Questions.model_json_schema(),
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "generate_questions"}},
)

generated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[
    "questions"
]
generated_questions[0]
generated_dataset = []
for q in generated_questions:
    try:
        result = execute_query(q["sql"])
        generated_dataset.append(
            {
                "input": q["question"],
                "expected": {
                    "results": result,
                    "error": None,
                    "query": q["sql"],
                },
                "metadata": {
                    "category": "Generated",
                },
            }
        )
    except duckdb.Error as e:
        print(f"Query failed: {q['sql']}", e)
        print("Skipping...")

generated_dataset[0]

Awesome, let's crate a dataset with the new synthetic data.

synthetic_dataset = px.Client().upload_dataset(
    dataset_name="nba-golden-synthetic",
    inputs=[{"question": example["input"]} for example in generated_dataset],
    outputs=[example["expected"] for example in generated_dataset],
);
run_experiment(
    synthetic_dataset, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

Amazing! Now we have a rich dataset to work with and some failures to debug. From here, you could try to investigate whether some of the generated data needs improvement, or try tweaking the prompt to improve accuracy, or maybe even something more adventurous, like feed the errors back to the model and have it iterate on a better query. Most importantly, we have a good workflow in place to iterate on both the application and dataset.

Trying a smaller model

Just for fun, let's wrap things up by trying out GPT-3.5-turbo. All we need to do is switch the model name, and run our Eval() function again.

TASK_MODEL = "gpt-3.5-turbo"

experiment = run_experiment(
    synthetic_dataset,
    task=task,
    evaluators=[no_error, has_results],
    experiment_metadata={"model": TASK_MODEL},
)

Interesting! It looks like the smaller model is able to do decently well but we might want to ensure it follows instructions as well as a larger model. We can actually grab all the LLM spans from our previous GPT40 runs and use them to generate a OpenAI fine-tuning JSONL file!

Conclusion

In this example, we walked through the process of building a dataset for a text2sql application. We started with a few handwritten examples, and iterated on the dataset by using an LLM to generate more examples. We used the eval framework to track our progress, and iterated on the model and dataset to improve the results. Finally, we tried out a less powerful model to see if we could save cost or improve latency.

Happy evaluations!

Chain-of-Thought Prompting

LLMs excel at text generation, but their reasoning abilities depend on how we prompt them. Chain of Thought (CoT) prompting enhances logical reasoning by guiding the model to think step by step, improving accuracy in tasks like math, logic, and multi-step problem solving.

In this tutorial, you will:

  • Examine how different prompting techniques influence reasoning by evaluating model performance on a dataset.

  • Refine prompting strategies, progressing from basic approaches to structured reasoning.

  • Utilize Phoenix to assess accuracy at each stage and explore the model's thought process.

  • Learn how to apply CoT prompting effectively in real-world tasks.

⚠️ You'll need an OpenAI Key for this tutorial.

Let’s dive in! 🚀

Set up Dependencies and Keys

!pip install -qqqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset Into Phoenix

This dataset includes math word problems, step-by-step explanations, and their corresponding answers. As we refine our prompt, we'll test it against the dataset to measure and track improvements in performance.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

import uuid

from datasets import load_dataset

import phoenix as px
from phoenix.client import Client as PhoenixClient

ds = load_dataset("syeddula/math_word_problems")["train"]
ds = ds.to_pandas()
ds.head()

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Word Problem"],
    output_keys=["Answer"],
    dataset_name=f"wordproblems-{unique_id}",
)

Zero-Shot Prompting - Baseline

Zero-shot prompting is the simplest way to interact with a language model—it involves asking a question without providing any examples or reasoning steps. The model generates an answer based solely on its pre-trained knowledge.

This serves as our baseline for comparison. By evaluating its performance on our dataset, we can see how well the model solves math word problems without explicit guidance. In later sections, we’ll introduce structured reasoning techniques like Chain of Thought (CoT) to measure improvements in accuracy and answers.

from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

from phoenix.client.types import PromptVersion

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "You are an evaluator who outputs the answer to a math word problem. Only respond with the integer answer. Be sure not include words, explanations, symbols, labels, or units and round all decimals answers.",
        },
        {"role": "user", "content": "{{Problem}}"},
    ],
)

prompt_identifier = "wordproblems"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="A prompt for computing answers to word problems.",
    version=PromptVersion.from_openai(params),
)

At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.

Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.

Next, we will define a task and evaluator for the experiment. Then, we run our experiment.

Because our dataset has ground truth labels, we can use a simple function to extract the answer and check if the calculated answer matches the expected output.

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()


def zero_shot_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"Problem": input["Word Problem"]})
    )
    return resp.choices[0].message.content.strip()


def evaluate_response(output, expected):
    if not output.isdigit():
        return False
    return int(output) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=zero_shot_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot Prompt",
    experiment_name="zero-shot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

We can review the results of the experiment in Phoenix. We achieved ~75% accuracy in this run. In the following sections, we will iterate on this prompt and see how our evaluation changes!

Note: Throughout this tutorial, you will encounter various evaluator outcomes. At times, you may notice a decline in performance compared to the initial experiment. However, this is not necessarily a flaw. Variations in results can arise due to factors such as the choice of LLM, inherent model behaviors, and randomness.

Zero-Shot CoT Prompting

Zero-shot prompting provides a direct answer, but it often struggles with complex reasoning. Zero-Shot Chain of Thought (CoT) prompting improves this by explicitly instructing the model to think step by step before arriving at a final answer.

By adding a simple instruction like “Let’s think through this step by step,” we encourage the model to break down the problem logically. This structured reasoning can lead to more accurate answers, especially for multi-step math problems.

In this section, we'll compare Zero-Shot CoT against our baseline to evaluate its impact on performance. First, let's create the prompt.

zero_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem.

You must always think through the problem logically before providing an answer.

First, show some of your reasoning.

Then output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.

"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": zero_shot_COT_template},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

zero_shot_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Zero Shot COT prompt",
    version=PromptVersion.from_openai(params),
)

This updated prompt is now lives in Phoenix as a new prompt version.

Next, we run our task and evaluation by extracting the answer from the output of our LLM.

import re


def zero_shot_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **zero_shot_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=zero_shot_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Zero-Shot COT Prompt",
    experiment_name="zero-shot-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + zero_shot_COT.id},
)

By clicking into the experiment in Phoenix, you can take a look at the steps the model took the reach the answer. By telling the model to think through the problem and output reasoning, we see a performance improvement.

Self-Consistency CoT Prompting

Even with Chain of Thought prompting, a single response may not always be reliable. Self-Consistency CoT enhances accuracy by generating multiple reasoning paths and selecting the most common answer. Instead of relying on one response, we sample multiple outputs and aggregate them, reducing errors caused by randomness or flawed reasoning steps.

This method improves robustness, especially for complex problems where initial reasoning steps might vary. In this section, we'll compare Self-Consistency CoT to our previous promppts to see how using on multiple responses impacts overall performance.

Let's repeat the same process as above with a new prompt and evaluate the outcome.

consistency_COT_template = """

You are an evaluator who outputs the answer to a math word problem.

Follow these steps:
1. Solve the problem **multiple times independently**, thinking through the solution carefully each time.
2. Show some of your reasoning for each independent attempt.
3. Identify the integer answer that appears most frequently across your attempts.
4. On a **new line**, output only this majority answer as a plain integer with **no words, commas, labels, units, or special characters**.
"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": consistency_COT_template},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

self_consistency_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="self consistency COT prompt",
    version=PromptVersion.from_openai(params),
)
def self_consistency_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **self_consistency_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


initial_experiment = run_experiment(
    dataset,
    task=self_consistency_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Self Consistency COT Prompt",
    experiment_name="self-consistency-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + self_consistency_COT.id},
)

We've observed a significant improvement in performance! Since the prompt instructs the model to compute the answer multiple times independently, you may notice that the experiment takes slightly longer to run. You can click into the experiement explore to view the independent computations the model performed for each problem.

Few Shot CoT Prompting

Few-shot CoT prompting enhances reasoning by providing worked examples before asking the model to solve a new problem. By demonstrating step-by-step solutions, the model learns to apply similar logical reasoning to unseen questions.

This method leverages in-context learning, allowing the model to generalize patterns from the examples.

In this final section, we’ll compare Few-Shot CoT against our previous prompts.

First, let's construct our prompt by sampling examples from a test dataset.

ds = load_dataset("syeddula/math_word_problems")["test"]
few_shot_examples = ds.to_pandas().sample(5)
few_shot_examples

We now will construct our final prompt, run the experiement, and view the results. Under the Prompts tab in Phoenix, you can track the version history of your prompt and see what random examples were chosen.

few_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem. You must always think through the problem logically before providing an answer. Show some of your reasoning.

Finally, output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.

Here are some examples of word problems, step by step explanations, and solutions to guide your reasoning:

{examples}
"""
params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": few_shot_COT_template.format(examples=few_shot_examples)},
        {"role": "user", "content": "{{Problem}}"},
    ],
)

few_shot_COT = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Few Shot COT prompt",
    version=PromptVersion.from_openai(params),
)
def few_shot_COT_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **few_shot_COT.format(variables={"Problem": input["Word Problem"]})
    )
    response_text = resp.choices[0].message.content.strip()
    lines = response_text.split("\n")
    final_answer = lines[-1].strip()
    final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
    return {"full_response": response_text, "final_answer": final_answer}


def evaluate_response(output, expected):
    final_answer = output["final_answer"]
    if not final_answer.isdigit():
        return False
    return int(final_answer) == int(expected["Answer"])


import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset,
    task=few_shot_COT_prompt,
    evaluators=[evaluate_response],
    experiment_description="Few-Shot COT Prompt",
    experiment_name="few-shot-cot-prompt",
    experiment_metadata={"prompt": "prompt_id=" + few_shot_COT.id},
)

Final Results

After running all of your experiments, you can compare the performance of different prompting techniques. Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.

You can review your prompt version history in the Prompts tab and explore the Playground to iterate further and run additional experiments.

To refine and test these prompts against other datasets, experiment with Chain of Thought (CoT) prompting to see its relevance to your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using the TypeScript and Python Clients.

Results
Results2

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

From here, you can check out more , and if you haven't already, ⭐️

connect to a self-hosted Phoenix instance
examples on Phoenix
please give us a star on GitHub!
Google Colab
Logo

LLM as a Judge Prompt Optimization

LLM as a Judge

An LLM as a Judge refers to using an LLM as a tool for evaluating and scoring responses based on predefined criteria.

While LLMs are powerful tools for evaluation, their performance can be inconsistent. Factors like ambiguity in the prompt, biases in the model, or a lack of clear guidelines can lead to unreliable results. By fine-tuning your LLM as a Judveprompts, you can improve the model's consistency, fairness, and accuracy, ensuring it delivers more reliable evaluations.

In this tutorial, you will:

  • Generate an LLM as a Judge evaluation prompt and test it against a datset

  • Learn about various optimization techniques to improve the template, measuring accuracy at each step using Phoenix evaluations

  • Understand how to apply these techniques together for better evaluation across your specific use cases

Set Up Dependencies and Keys

!pip install -q "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset into Phoenix

In this tutorial, we will focus on creating an LLM as a Judge prompt designed to assess empathy and emotional intelligence in chatbot responses. This is especially useful for use cases like mental health chatbots or customer support interactions.

We will start by loading a dataset containing 30 chatbot responses, each with a score for empathy and emotional intelligence (out of 10). Throughout the tutorial, we’ll use our prompt to evaluate these responses and compare the output to the ground-truth labels. This will allow us to assess how well our prompt performs.

from datasets import load_dataset

ds = load_dataset("syeddula/empathy_scores")["test"]
ds = ds.to_pandas()
ds.head()

import uuid

import phoenix as px

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["AI_Response", "EI_Empathy_Score"],
    output_keys=["EI_Empathy_Score"],
    dataset_name=f"empathy-{unique_id}",
)

Generate LLM as a Judge Template using Meta Prompting

Before iterating on our template, we need to establish a prompt. Running the cell below will generate an LLM as a Judge prompt specifically for evaluating empathy and emotional intelligence. When generating this template, we emphasize:

  • Picking evaluation criteria (e.g., empathy, emotional support, emotional intelligence).

  • Defining a clear scoring system (1-10 scale with defined descriptions).

  • Setting response formatting guidelines for clarity and consistency.

  • Including an explanation for why the LLM selects a given score.

from openai import OpenAI

client = OpenAI()


def generate_eval_template():
    meta_prompt = """
    You are an expert in AI evaluation and emotional intelligence assessment. Your task is to create a structured evaluation template for assessing the emotional intelligence and empathy of AI responses to user inputs.

    ### Task Overview:
    Generate a detailed evaluation template that measures the AI’s ability to recognize user emotions, respond empathetically, and provide emotionally appropriate responses. The template should:
    - Include 3 to 5 distinct evaluation criteria that assess different aspects of emotional intelligence.
    - Define a scoring system on a scale of 1 to 10, ensuring a broad distribution of scores across different responses.
    - Provide clear, tiered guidelines for assigning scores, distinguishing weak, average, and strong performance.
    - Include a justification section requiring evaluators to explain the assigned score with specific examples.
    - Ensure the scoring rubric considers complexity and edge cases, preventing generic or uniform scores.

    ### Format:
    Return the evaluation template as plain text, structured with headings, criteria, and a detailed scoring rubric. The template should be easy to follow and apply to real-world datasets.

    ### Scoring Guidelines:
    - The scoring system must be on a **scale of 1 to 10** and encourage a full range of scores.
    - Differentiate between strong, average, and weak responses using specific, well-defined levels.
    - Require evaluators to justify scores

    Do not include any concluding remarks such as 'End of Template' or similar statements. The template should end naturally after the final section.

    """

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": meta_prompt}],
            temperature=0.9,  # High temperature for more creativity
        )

        return response.choices[0].message.content
    except Exception as e:
        return {"error": str(e)}


print("Generating new evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE = generate_eval_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE)

Testing Our Initial Prompt

Instrument the application to send traces to Phoenix:

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register(
    project_name="LLM-as-a-Judge", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Now that we have our baseline prompt, we need to set up two key components:

  • Task: The LLM as a Judge evaluation, where the model scores chatbot responses based on empathy and emotional intelligence.

  • Evaluator: A function that compares the LLM as a Judge output to the ground-truth labels from our dataset

Finally, we run our experiment. With this setup, we can measure how well our prompt initially performs.

import pandas as pd

from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)


def initial_prompt(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


def evaluate_response(input, output):
    expected_score = input["EI_Empathy_Score"]
    predicted_score = output
    return abs(expected_score - predicted_score) <= 2
import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset, task=initial_prompt, evaluators=[evaluate_response], experiment_name="initial_prompt"
)

Iteration 1: Improve Accuracy

If you find that your LLM as a Judge prompt has low accuracy, we can make adjustmenets to the prompt to improve that. In this section, we explore 2 techniques for this: few shot examples and keeping a human in the loop.

Technique 1: Few Shot Examples

Few-shot examples help improve the accuracy of an LLM as a Judge prompt by providing clear reference points for evaluation. Instead of relying solely on general instructions, the model learns from labeled examples that demonstrate correct scoring and reasoning.

By including a mix of high, medium, and low-scoring responses, we help the model:

  • Understand nuanced criteria like empathy and emotional intelligence.

  • Reduce inconsistencies by aligning with real-world judgments.

few_shot_examples = """
---
To help guide your reasoning, compare the text to the following examples and scorings:

Example 1: "You should try to focus and get your work done. Everyone has tough days."
  Score: 2
  Explanation: No acknowledgment of the user’s emotions, dismissive and offers no real support

Example 2: "That sounds really stressful. Have you considered talking to a supervisor or taking breaks?"
  Score: 5
  Explanation: Acknowledges stress, but in a generic way. Provides advice, but not very personal. Could be warmer in tone.

Example 3: "I’m really sorry you’re feeling this way. It’s completely understandable to feel overwhelmed. You’re not alone in this. Have you had a chance to take a break or talk to someone who can support you?"
  Score: 9
  Explanation: Validates emotions, reassures the user, and offers support
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples
def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset,
    task=llm_as_a_judge,
    evaluators=[evaluate_response],
    experiment_name="few_shot_examples",
)

Technique 2: Human in the Loop

  • Catch edge cases and biases that the model may overlook.

  • Refine scoring guidelines by identifying inconsistencies in LLM outputs.

  • Continuously improve the prompt by analyzing where the model struggles and adjusting instructions accordingly.

However, human review can be costly and time-intensive, making full-scale annotation impractical. Fortunately, even a small number of human-labeled examples can significantly enhance accuracy.

Iteration 2: Reduce Bias

Style Invariant Evaluation

One common bias in LLM as a Judge evaluations is favoring certain writing styles over others. For example, the model might unintentionally rate formal, structured responses higher than casual or concise ones, even if both convey the same level of empathy or intelligence.

To reduce this bias, we focus on style-invariant evaluation, ensuring that the LLM judges responses based on content rather than phrasing or tone. This can be achieved by:

  • Providing diverse few-shot examples that include different writing styles.

  • Testing for bias by evaluating responses with varied phrasing and ensuring consistent scoring.

By making evaluations style-agnostic, we create a more robust scoring system that doesn’t unintentionally penalize certain tones.

style_invariant = """
----
To help guide your reasoning, below is an example of how different response styles and tones can achieve similar scores:

#### Scenario: Customer Support Handling a Late Order
User: "My order is late, and I needed it for an important event. This is really frustrating."

Response A (Formal): "I sincerely apologize for the delay..."
Response B (Casual): "Oh no, that’s really frustrating!..."
Response C (Direct): "Sorry about that. I’ll check..."
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + style_invariant
def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="style_invariant"
)

Iteration 3: Reduce Cost and Latency

Longer prompts increase computation costs and response times, making evaluations slower and more expensive. To optimize efficiency, we focus on condensing the prompt while preserving clarity and effectiveness. This is done by:

  • Removing redundant instructions and simplifying wording.

  • Using bullet points or structured formats for concise guidance.

  • Eliminating unnecessary explanations while keeping critical evaluation criteria intact.

A well-optimized prompt reduces token count, leading to faster, more cost-effective evaluations without sacrificing accuracy or reliability.

def generate_condensed_template():
    meta_prompt = """
    You are an expert in prompt engineering and LLM evaluation. Your task is to optimize a given LLM-as-a-judge prompt by reducing its word count significantly while maintaining all essential information, including evaluation criteria, scoring system, and purpose.

    Requirements:
    Preserve all key details such as metrics, scoring guidelines, and judgment criteria.

    Eliminate redundant phrasing and unnecessary explanations.

    Ensure clarity and conciseness without losing meaning.

    Maintain the prompt’s effectiveness for consistent evaluations.

    Output Format:
    Return only the optimized prompt as plain text, with no explanations or commentary.

    """

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "user",
                    "content": "Provided LLM-as-a-judge prompt"
                    + EMPATHY_EVALUATION_PROMPT_TEMPLATE,
                },
                {"role": "user", "content": meta_prompt},
            ],
            temperature=0.9,  # High temperature for more creativity
        )

        return response.choices[0].message.content
    except Exception as e:
        return {"error": str(e)}


print("Generating condensed evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED = generate_condensed_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED)
def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="condensed_prompt"
)

Iteration 4: Self-Refinement (Iterative LLM as Judge)

Self-refinement allows a Judge to improve its own evaluations by critically analyzing and adjusting its initial judgments. Instead of providing a static score, the model engages in an iterative process:

  • Generate an initial score based on the evaluation criteria.

  • Reflect on its reasoning, checking for inconsistencies or biases.

  • Refine the score if needed, ensuring alignment with the evaluation guidelines.

By incorporating this style of reasoning, the model can justify its decisions and self-correct errors.

refinement_text = """
---
After you have done the evaluation, follow these two steps:
1. Self-Critique
Review your initial score:
- Was it too harsh or lenient?
- Did it consider the full context?
- Would others agree with your score?
Explain any inconsistencies briefly.

2. Final Refinement
Based on your critique, adjust your score if necessary.
- Only output a number (1-10)
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + refinement_text
def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="self_refinement"
)

Iteration 5: Combining Techniques

To maximize the accuracy and fairness of our Judge, we will combine multiple optimization techniques. In this example, we will incorporate few-shot examples and style-invariant evaluation to ensure the model focuses on content rather than phrasing or tone.

By applying these techniques together, we aim to create a more reliable evaluation framework.

EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = (
    EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples + style_invariant
)
def llm_as_a_judge(input):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
        template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
        model=OpenAIModel(model="gpt-4"),
        rails=list(map(str, range(1, 11))),
        provide_explanation=True,
    )
    score = response_classifications.iloc[0]["label"]
    return int(score)


experiment = run_experiment(
    dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="combined"
)

Final Results

Techniques like few-shot examples, self-refinement, style-invariant evaluation, and prompt condensation each offer unique benefits, but their effectiveness will vary depending on the task.

Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM and other inherent model behaviors.

By systematically testing and combining these approaches, you can refine your evaluation framework.

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

Phoenix offers many for LLM as a Judge, but often, you may need to build a custom evaluator for specific use cases.

Keeping a human in the loop improves the accuracy of an LLM as a Judge by providing oversight, validation, and corrections where needed. In Phoenix, we can do this with . While LLMs can evaluate responses based on predefined criteria, human reviewers help:

Human Annotation
Final Results
connect to a self-hosted Phoenix instance
pre-built evaluation templates
annotations

Prompt Optimization

This tutorial will use Phoenix to compare the performance of different prompt optimization techniques.

You'll start by creating an experiment in Phoenix that can house the results of each of your resulting prompts. Next you'll use a series of prompt optimization techniques to improve the performance of a jailbreak classification task. Each technique will be applied to the same base prompt, and the results will be compared using Phoenix.

The techniques you'll use are:

  • Few Shot Examples: Adding a few examples to the prompt to help the model understand the task.

  • Meta Prompting: Prompting a model to generate a better prompt based on previous inputs, outputs, and expected outputs.

  • Prompt Gradients: Using the gradient of the prompt to optimize individual components of the prompt using embeddings.

  • DSPy Prompt Tuning: Using DSPy, an automated prompt tuning library, to optimize the prompt.

⚠️ This tutorial requires and OpenAI API key.

Let's get started!

Setup Dependencies & Keys

!pip install -q "arize-phoenix>=8.0.0" datasets
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Load Dataset into Phoenix

Since we'll be running a series of experiments, we'll need a dataset of test cases that we can run each time. This dataset will be used to test the performance of each prompt optimization technique.

from datasets import load_dataset

ds = load_dataset("jackhhao/jailbreak-classification")["train"]
ds = ds.to_pandas().sample(50)
ds.head()
import uuid

import phoenix as px
from phoenix.client import Client as PhoenixClient

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["prompt"],
    output_keys=["type"],
    dataset_name=f"jailbreak-classification-{unique_id}",
)

Next, you can define a base template for the prompt. We'll also save this template to Phoenix, so it can be tracked, versioned, and reused across experiments.

from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

from phoenix.client.types import PromptVersion

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output 'benign' or 'jailbreak', no other words.",
        },
        {"role": "user", "content": "{{prompt}}"},
    ],
)

prompt_identifier = "jailbreak-classification"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="A prompt for classifying whether a given prompt is a jailbreak risk.",
    version=PromptVersion.from_openai(params),
)

You should now see that prompt in Phoenix:

Next you'll need a task and evaluator for the experiment. A task is a function that will be run across each example in the dataset. The task is also the piece of your code that you'll change between each run of the experiment. To start off, the task is simply a call to GPT 3.5 Turbo with a basic prompt.

You'll also need an evaluator that will be used to test the performance of the task. The evaluator will be run across each example in the dataset after the task has been run. Here, because you have ground truth labels, you can use a simple function to check if the output of the task matches the expected output.

def test_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(**prompt.format(variables={"prompt": input["prompt"]}))
    return resp.choices[0].message.content.strip()


def evaluate_response(output, expected):
    return output.lower() == expected["type"].lower()

You can also instrument your code to send all models calls to Phoenix. This isn't necessary for the experiment to run, but it does mean all your experiment task runs will be tracked in Phoenix. The overall experiment score and evaluator runs will be tracked regardless of whether you instrument your code or not.

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register(project_name="prompt-optimization")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Now you can run the initial experiment. This will be the base prompt that you'll be optimizing.

import nest_asyncio

from phoenix.experiments import run_experiment

nest_asyncio.apply()

initial_experiment = run_experiment(
    dataset,
    task=test_prompt,
    evaluators=[evaluate_response],
    experiment_description="Initial base prompt",
    experiment_name="initial-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

You should now see the initial experiment results in Phoenix:

Prompt Optimization Technique #1: Few Shot Examples

One common prompt optimization technique is to use few shot examples to guide the model's behavior.

Here you can add few shot examples to the prompt to help improve performance. Conviently, the dataset you uploaded in the last step contains a test set that you can use for this purpose.

from datasets import load_dataset

ds_test = load_dataset("jackhhao/jailbreak-classification")[
    "test"
]  # this time, load in the test set instead of the training set
few_shot_examples = ds_test.to_pandas().sample(10)

Define a new prompt that includes the few shot examples. Prompts in Phoenix are automatically versioned, so saving the prompt with the same name will create a new version that can be used.

few_shot_template = """
You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output "benign" or "jailbreak", no other words.

Here are some examples of prompts and responses:

{examples}
"""

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": few_shot_template.format(examples=few_shot_examples)},
        {"role": "user", "content": "{{prompt}}"},
    ],
)

few_shot_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Few shot prompt",
    version=PromptVersion.from_openai(params),
)

You'll notice you now have a new version of the prompt in Phoenix:

Define a new task with your new prompt:

def test_prompt(input):
    client = OpenAI()
    prompt_vars = {"prompt": input["prompt"]}
    resp = client.chat.completions.create(**few_shot_prompt.format(variables=prompt_vars))
    return resp.choices[0].message.content.strip()

Now you can run another experiment with the new prompt. The dataset of test cases and the evaluator will be the same as the previous experiment.

few_shot_experiment = run_experiment(
    dataset,
    task=test_prompt,
    evaluators=[evaluate_response],
    experiment_description="Prompt Optimization Technique #1: Few Shot Examples",
    experiment_name="few-shot-examples",
    experiment_metadata={"prompt": "prompt_id=" + few_shot_prompt.id},
)

Prompt Optimization Technique #2: Meta Prompting

Meta prompting involves prompting a model to generate a better prompt, based on previous inputs, outputs, and expected outputs.

The experiment from round 1 serves as a great starting point for this technique, since it has each of those components.

# Access the experiment results from the first round as a dataframe
ground_truth_df = initial_experiment.as_dataframe()

# Sample 10 examples to use as meta prompting examples
ground_truth_df = ground_truth_df[:10]

# Create a new column with the examples in a single string
ground_truth_df["example"] = ground_truth_df.apply(
    lambda row: f"Input: {row['input']}\nOutput: {row['output']}\nExpected Output: {row['expected']}",
    axis=1,
)
ground_truth_df.head()

Now construct a new prompt that will be used to generate a new prompt.

meta_prompt = """
You are an expert prompt engineer. You are given a prompt, and a list of examples.

Your job is to generate a new prompt that will improve the performance of the model.

Here are the examples:

{examples}

Here is the original prompt:

{prompt}

Here is the new prompt:
"""

original_base_prompt = (
    prompt.format(variables={"prompt": "example prompt"}).get("messages")[0].get("content")
)

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": meta_prompt.format(
                prompt=original_base_prompt, examples=ground_truth_df["example"].to_string()
            ),
        }
    ],
)
new_prompt = response.choices[0].message.content.strip()
new_prompt

Now save that as a prompt in Phoenix:

if r"\{examples\}" in new_prompt:
    new_prompt = new_prompt.format(examples=few_shot_examples)

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": new_prompt},
        {"role": "user", "content": "{{prompt}}"},
    ],
)

meta_prompt_result = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Meta prompt result",
    version=PromptVersion.from_openai(params),
)

Run this new prompt through the same experiment

Redefine the task, using the new prompt.

def test_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **meta_prompt_result.format(variables={"prompt": input["prompt"]})
    )
    return resp.choices[0].message.content.strip()
meta_prompting_experiment = run_experiment(
    dataset,
    task=test_prompt,
    evaluators=[evaluate_response],
    experiment_description="Prompt Optimization Technique #2: Meta Prompting",
    experiment_name="meta-prompting",
    experiment_metadata={"prompt": "prompt_id=" + meta_prompt_result.id},
)

Prompt Optimization Technique #3: Prompt Gradient Optimization

Prompt gradient optimization is a technique that uses the gradient of the prompt to optimize individual components of the prompt using embeddings. It involves:

  1. Converting the prompt into an embedding.

  2. Comparing the outputs of successful and failed prompts to find the gradient direction.

  3. Moving in the gradient direction to optimize the prompt.

Here you'll define a function to get embeddings for prompts, and then use that function to calculate the gradient direction between successful and failed prompts.

import numpy as np


# First we'll define a function to get embeddings for prompts
def get_embedding(text):
    client = OpenAI()
    response = client.embeddings.create(model="text-embedding-ada-002", input=text)
    return response.data[0].embedding


# Function to calculate gradient direction between successful and failed prompts
def calculate_prompt_gradient(successful_prompts, failed_prompts):
    # Get embeddings for successful and failed prompts
    successful_embeddings = [get_embedding(p) for p in successful_prompts]
    failed_embeddings = [get_embedding(p) for p in failed_prompts]

    # Calculate average embeddings
    avg_successful = np.mean(successful_embeddings, axis=0)
    avg_failed = np.mean(failed_embeddings, axis=0)

    # Calculate gradient direction
    gradient = avg_successful - avg_failed
    return gradient / np.linalg.norm(gradient)


# Get successful and failed examples from our dataset
successful_examples = (
    ground_truth_df[ground_truth_df["output"] == ground_truth_df["expected"].get("type")]["input"]
    .apply(lambda x: x["prompt"])
    .tolist()
)
failed_examples = (
    ground_truth_df[ground_truth_df["output"] != ground_truth_df["expected"].get("type")]["input"]
    .apply(lambda x: x["prompt"])
    .tolist()
)

# Calculate the gradient direction
gradient = calculate_prompt_gradient(successful_examples[:5], failed_examples[:5])


# Function to optimize a prompt using the gradient
def optimize_prompt(base_prompt, gradient, step_size=0.1):
    # Get base embedding
    base_embedding = get_embedding(base_prompt)

    # Move in gradient direction
    optimized_embedding = base_embedding + step_size * gradient

    # Use GPT to convert the optimized embedding back to text
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You are helping to optimize prompts. Given the original prompt and its embedding, generate a new version that maintains the core meaning but moves in the direction of the optimized embedding.",
            },
            {
                "role": "user",
                "content": f"Original prompt: {base_prompt}\nOptimized embedding direction: {optimized_embedding[:10]}...\nPlease generate an improved version that moves in this embedding direction.",
            },
        ],
    )
    return response.choices[0].message.content.strip()


# Test the gradient-based optimization
gradient_prompt = optimize_prompt(original_base_prompt, gradient)
gradient_prompt
if r"\{examples\}" in gradient_prompt:
    gradient_prompt = gradient_prompt.format(examples=few_shot_examples)

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": gradient_prompt,
        },  # if your meta prompt includes few shot examples, make sure to include them here
        {"role": "user", "content": "{{prompt}}"},
    ],
)

gradient_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Gradient prompt result",
    version=PromptVersion.from_openai(params),
)

Run experiment with gradient-optimized prompt

Redefine the task, using the new prompt.

def test_gradient_prompt(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **gradient_prompt.format(variables={"prompt": input["prompt"]})
    )
    return resp.choices[0].message.content.strip()
gradient_experiment = run_experiment(
    dataset,
    task=test_gradient_prompt,
    evaluators=[evaluate_response],
    experiment_description="Prompt Optimization Technique #3: Prompt Gradients",
    experiment_name="gradient-optimization",
    experiment_metadata={"prompt": "prompt_id=" + gradient_prompt.id},
)

Prompt Optimization Technique #4: Prompt Tuning with DSPy

!pip install -q dspy openinference-instrumentation-dspy

DSPy makes a series of calls to optimize the prompt. It can be useful to see these calls in action. To do this, you can instrument the DSPy library using the OpenInference SDK, which will send all calls to Phoenix. This is optional, but it can be useful to have.

from openinference.instrumentation.dspy import DSPyInstrumentor

DSPyInstrumentor().instrument(tracer_provider=tracer_provider)

Now you'll setup the DSPy language model and define a prompt classification task.

# Import DSPy and set up the language model
import dspy

# Configure DSPy to use OpenAI
turbo = dspy.LM(model="gpt-3.5-turbo")
dspy.settings.configure(lm=turbo)


# Define the prompt classification task
class PromptClassifier(dspy.Signature):
    """Classify if a prompt is benign or jailbreak."""

    prompt = dspy.InputField()
    label = dspy.OutputField(desc="either 'benign' or 'jailbreak'")


# Create the basic classifier
classifier = dspy.Predict(PromptClassifier)

Your classifier can now be used to make predictions as you would a normal LLM. It will expect a prompt input and will output a label prediction.

classifier(prompt=ds.iloc[0].prompt)

However, DSPy really shines when it comes to optimizing prompts. By defining a metric to measure successful runs, along with a training set of examples, you can use one of many different optimizers built into the library.

In this case, you'll use the MIPROv2 optimizer to find the best prompt for your task.

def validate_classification(example, prediction, trace=None):
    return example["label"] == prediction["label"]


# Prepare training data from previous examples
train_data = []
for _, row in ground_truth_df.iterrows():
    example = dspy.Example(
        prompt=row["input"]["prompt"], label=row["expected"]["type"]
    ).with_inputs("prompt")
    train_data.append(example)

tp = dspy.MIPROv2(metric=validate_classification, auto="light")
optimized_classifier = tp.compile(classifier, trainset=train_data)

DSPy takes care of our prompts in this case, however you could still save the resulting prompt value in Phoenix:

params = CompletionCreateParamsBase(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": optimized_classifier.signature.instructions,
        },  # if your meta prompt includes few shot examples, make sure to include them here
        {"role": "user", "content": "{{prompt}}"},
    ],
)

dspy_prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="DSPy prompt result",
    version=PromptVersion.from_openai(params),
)

Run experiment with DSPy-optimized classifier

Redefine the task, using the new prompt.

# Create evaluation function using optimized classifier
def test_dspy_prompt(input):
    result = optimized_classifier(prompt=input["prompt"])
    return result.label
# Run experiment with DSPy-optimized classifier
dspy_experiment = run_experiment(
    dataset,
    task=test_dspy_prompt,
    evaluators=[evaluate_response],
    experiment_description="Prompt Optimization Technique #4: DSPy Prompt Tuning",
    experiment_name="dspy-optimization",
    experiment_metadata={"prompt": "prompt_id=" + dspy_prompt.id},
)

Prompt Optimization Technique #5: DSPy with GPT-4o

In the last example, you used GPT-3.5 Turbo to both run your pipeline, and optimize the prompt. However, you can also use a different model to optimize the prompt, and a different model to run your pipeline.

It can be useful to use a more powerful model for your optimization step, and a cheaper or faster model for your pipeline.

Here you'll use GPT-4o to optimize the prompt, and keep GPT-3.5 Turbo as your pipeline model.

prompt_gen_lm = dspy.LM("gpt-4o")
tp = dspy.MIPROv2(
    metric=validate_classification, auto="light", prompt_model=prompt_gen_lm, task_model=turbo
)
optimized_classifier_using_gpt_4o = tp.compile(classifier, trainset=train_data)

Run experiment with DSPy-optimized classifier using GPT-4o

Redefine the task, using the new prompt.

# Create evaluation function using optimized classifier
def test_dspy_prompt(input):
    result = optimized_classifier_using_gpt_4o(prompt=input["prompt"])
    return result.label
# Run experiment with DSPy-optimized classifier
dspy_experiment_using_gpt_4o = run_experiment(
    dataset,
    task=test_dspy_prompt,
    evaluators=[evaluate_response],
    experiment_description="Prompt Optimization Technique #5: DSPy Prompt Tuning with GPT-4o",
    experiment_name="dspy-optimization-gpt-4o",
    experiment_metadata={"prompt": "prompt_id=" + dspy_prompt.id},
)

Results

And just like that, you've run a series of prompt optimization techniques to improve the performance of a jailbreak classification task, and compared the results using Phoenix.

You should have a set of experiments that looks like this:

Google Colab
Google Colab
Google Colab

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

1st experiment results

Finally, you can use an optimization library to optimize the prompt, like DSPy. supports each of the techniques you've used so far, and more.

From here, you can check out more , and if you haven't already, ⭐️

connect to a self-hosted Phoenix instance
DSPy
examples on Phoenix
please give us a star on GitHub!
Logo
Logo
Logo
Google Colab
Logo
Google Colab
Logo

Evaluate RAG

Building a RAG pipeline and evaluating it with Phoenix Evals.

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

  1. Understanding Retrieval Augmented Generation (RAG).

  2. Building RAG (with the help of a framework such as LlamaIndex).

  3. Evaluating RAG with Phoenix Evals.

Retrieval Augmented Generation (RAG)

LLMs are trained on vast amounts of data, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

  • Loading: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.

  • Indexing: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

  • Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

  • Querying: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.

  • Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.

Build a RAG system

!pip install -qq "arize-phoenix[experimental,llama-index]>=2.0"
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

import os
from getpass import getpass

import pandas as pd
import phoenix as px
from llama_index import SimpleDirectoryReader, VectorStoreIndex, set_global_handler
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

px.launch_app()
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation.

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

Load Data and Build an Index

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Define an LLM
llm = OpenAI(model="gpt-4")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What did the author do growing up?")

Check the response that you get from the query.

response_vector.response
'The author wrote short stories and worked on programming, specifically on an IBM 1401 computer in 9th grade.'

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in vector_index.as_query_engine(similarity_top_k=k).

Let's check the text in each of these retrieved nodes.

# First retrieved node
response_vector.source_nodes[0].get_text()
'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.\n\nI was puzzled by the 1401. I couldn\'t figure out what to do with it. And in retrospect there\'s not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn\'t have any data stored on punched cards. The only other option was to do things that didn\'t rely on any input, like calculate approximations of pi, but I didn\'t know enough math to do anything interesting of that type. So I\'m not surprised I can\'t remember any programs I wrote, because they can\'t have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn\'t. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager\'s expression made clear.\n\nWith microcomputers, everything changed.'
# Second retrieved node
response_vector.source_nodes[1].get_text()
"It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n[2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione). It's the everyday words that differ. So if you string together a lot of abstract concepts with a few simple verbs, you can make a little Italian go a long way.\n\n[3] I lived at Piazza San Felice 4, so my walk to the Accademia went straight down the spine of old Florence: past the Pitti, across the bridge, past Orsanmichele, between the Duomo and the Baptistery, and then up Via Ricasoli to Piazza San Marco."

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

print("phoenix URL", px.active_session().url)

We can access the traces by directly pulling the spans from the phoenix session.

spans_df = px.active_session().get_spans_dataframe()
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()
context.span_id
name
span_kind
attributes.input.value
attributes.retrieval.documents

6aba9eee-91c9-4ee2-81e9-1bdae2eb435d

llm

LLM

NaN

NaN

cc9feb6a-30ba-4f32-af8d-8c62dd1b1b23

synthesize

CHAIN

What did the author do growing up?

NaN

8202dbe5-d17e-4939-abd8-153cad08bdca

embedding

EMBEDDING

NaN

NaN

aeadad73-485f-400b-bd9d-842abfaa460b

retrieve

RETRIEVER

What did the author do growing up?

[{'document.content': 'What I Worked On

Febru...

9e25c528-5e2f-4719-899a-8248bab290ec

query

CHAIN

What did the author do growing up?

NaN

Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.

spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()
context.span_id
attributes.input.value
attributes.retrieval.documents

aeadad73-485f-400b-bd9d-842abfaa460b

What did the author do growing up?

[{'document.content': 'What I Worked On

Febru...

We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

  • Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved

  • Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided.

Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's llm_generate to help us create the question-context pairs.

First, let's create a dataframe of all the document chunks that we have indexed.

# Let's construct a dataframe of just the documents that are in our index
document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
document_chunks_df.head()
text

0

What I Worked On\n\nFebruary 2021\n\nBefore co...

1

I was puzzled by the 1401. I couldn't figure o...

2

I remember vividly how impressed and envious I...

3

I couldn't have put this into words when I was...

4

This was more like it; this was what I had exp...

Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.

generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""
import json

from phoenix.evals import OpenAIModel, llm_generate


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}


questions_df = llm_generate(
    dataframe=document_chunks_df,
    template=generate_questions_template,
    model=OpenAIModel(
        model_name="gpt-3.5-turbo",
    ),
    output_parser=output_parser,
    concurrency=20,
)
questions_df.head()
question_1
question_2
question_3

0

What were the two main things the author worke...

What was the language the author used to write...

What was the author's clearest memory regardin...

1

What were the limitations of the 1401 computer...

How did microcomputers change the author's exp...

Why did the author's father buy a TRS-80 compu...

2

What was the author's first experience with co...

Why did the author decide to switch from study...

What were the two things that influenced the a...

3

What were the two things that inspired the aut...

What programming language did the author learn...

What was the author's undergraduate thesis about?

4

What was the author's undergraduate thesis about?

Which three grad schools did the author apply to?

What realization did the author have during th...

# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
    questions_with_document_chunk_df["question"].notnull()
]

The LLM has generated three questions per chunk. Let's take a quick look.

questions_with_document_chunk_df.head(10)
text
question

0

What I Worked On\n\nFebruary 2021\n\nBefore co...

What were the two main things the author worke...

1

I was puzzled by the 1401. I couldn't figure o...

What were the limitations of the 1401 computer...

2

I remember vividly how impressed and envious I...

What was the author's first experience with co...

3

I couldn't have put this into words when I was...

What were the two things that inspired the aut...

4

This was more like it; this was what I had exp...

What was the author's undergraduate thesis about?

5

Only Harvard accepted me, so that was where I ...

What realization did the author have during th...

6

So I decided to focus on Lisp. In fact, I deci...

What motivated the author to write a book abou...

7

Anyone who wanted one to play around with coul...

What realization did the author have while vis...

8

I knew intellectually that people made art — t...

What was the author's initial perception of pe...

9

Then one day in April 1990 a crack appeared in...

What was the author's initial plan for their d...

Retrieval Evaluation

We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.

# First things first, let's reset phoenix
px.close_app()
px.launch_app()
🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://arize.com/docs/phoenix





<phoenix.session.session.ThreadSession at 0x2c6c785b0>
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
    question = row["question"]
    response_vector = query_engine.query(question)
    print(f"Question: {question}\nAnswer: {response_vector.response}\n")
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.active_session())
retrieved_documents_df
context.trace_id
input
reference
document_score

context.span_id

document_position

b375be95-8e5e-4817-a29f-e18f7aaa3e98

0

20e0f915-e089-4e8e-8314-b68ffdffd7d1

How does leaving YC affect the author's relati...

On one of them I realized I was ready to hand ...

0.820411

1

20e0f915-e089-4e8e-8314-b68ffdffd7d1

How does leaving YC affect the author's relati...

That was what it took for Rtm to offer unsolic...

0.815969

e4e68b51-dbc9-4154-85a4-5cc69382050d

0

4ad14fd2-0950-4b3f-9613-e1be5e51b5a4

Why did YC become a fund for a couple of years...

For example, one thing Julian had done for us ...

0.860981

1

4ad14fd2-0950-4b3f-9613-e1be5e51b5a4

Why did YC become a fund for a couple of years...

They were an impressive group. That first batc...

0.849695

27ba6b6f-828b-4732-bfcc-3262775cd71f

0

d62fb8e8-4247-40ac-8808-818861bfb059

Why did the author choose the name 'Y Combinat...

Screw the VCs who were taking so long to make ...

0.868981

...

...

...

...

...

...

353f152c-44ce-4f3e-a323-0caa90f4c078

1

6b7bebf6-bed3-45fd-828a-0730d8f358ba

What was the author's first experience with co...

What I Worked On\n\nFebruary 2021\n\nBefore co...

0.877719

16de2060-dd9b-4622-92a1-9be080564a40

0

6ce5800d-7186-414e-a1cf-1efb8d39c8d4

What were the limitations of the 1401 computer...

I was puzzled by the 1401. I couldn't figure o...

0.847688

1

6ce5800d-7186-414e-a1cf-1efb8d39c8d4

What were the limitations of the 1401 computer...

I remember vividly how impressed and envious I...

0.836979

e996c90f-4ea9-4f7c-b145-cf461de7d09b

0

a328a85a-aadd-44f5-b49a-2748d0bd4d2f

What were the two main things the author worke...

What I Worked On\n\nFebruary 2021\n\nBefore co...

0.843280

1

a328a85a-aadd-44f5-b49a-2748d0bd4d2f

What were the two main things the author worke...

Then one day in April 1990 a crack appeared in...

0.822055

Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on explanations which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

from phoenix.evals import (
    RelevanceEvaluator,
    run_evals,
)

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]
retrieved_documents_relevance_df.head()

We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.

documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df
import numpy as np
from sklearn.metrics import ndcg_score


def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values"""
    n = max(2, len(df))
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_score
    doc_scores[: len(df)] = df.document_score
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)
ndcg_at_2

Let's also compute precision at 2 for all our retrieval steps.

precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) / 2
        )
    }
)
precision_at_2

Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)

hit = pd.DataFrame(
    {
        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) > 0
        )
    }
)

Let's now view the results in a combined dataframe.

retrievals_df = px.active_session().get_spans_dataframe("span_kind == 'RETRIEVER'")
rag_evaluation_dataframe = pd.concat(
    [
        retrievals_df["attributes.input.value"],
        ndcg_at_2.add_prefix("ncdg@2_"),
        precision_at_2.add_prefix("precision@2_"),
        hit,
    ],
    axis=1,
)
rag_evaluation_dataframe

Observations

Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.

# Aggregate the scores across the retrievals
results = rag_evaluation_dataframe.mean(numeric_only=True)
results
ncdg@2_score         0.913450
precision@2_score    0.804598
hit                  0.936782
dtype: float64

As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.

We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.

from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
    SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

Response Evaluation

The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.

from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.active_session())
qa_with_reference_df
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

qa_evaluator = QAEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

qa_correctness_eval_df, hallucination_eval_df = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_with_reference_df,
    provide_explanation=True,
    concurrency=20,
)
qa_correctness_eval_df.head()
hallucination_eval_df.head()

Observations

Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.

qa_correctness_eval_df.mean(numeric_only=True)
score    0.931034
dtype: float64
hallucination_eval_df.mean(numeric_only=True)
score    0.051724
dtype: float64

Our QA Correctness score of 0.91 and a Hallucinations score 0.05 signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.

Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.

from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)
Sending Evaluations: 100%|██████████| 348/348 [00:00<00:00, 415.37it/s]

We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.

print("phoenix URL", px.active_session().url)
phoenix URL http://localhost:6006/

Conclusion

We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.

Now that we have understood the stages of RAG, let's build a pipeline. We will use for RAG and for evaluation.

Let's use an to build our RAG pipeline.

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the ).

Let's compute Normalized Discounted Cumulative Gain at 2 for all our retrieval steps. In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.

Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the .

Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the documentation.

LlamaIndex
Phoenix Evals
essay by Paul Graham
docs
NCDG
LLM Evals documentation
LLM Evals
Google Colaboratory
Logo

Evaluate an Agent

This notebook serves as an end-to-end example of how to trace and evaluate an agent. The example uses a "talk-to-your-data" agent as its example.

The notebook shows examples of:

  • Manually instrumenting an agent using Phoenix decorators

  • Evaluating function calling accuracy using LLM as a Judge

  • Evaluating function calling accuracy by comparing to ground truth

  • Evaluating SQL query generation

  • Evaluating Python code generation

  • Evaluating the path of an agent

Install Dependencies, Import Libraries, Set API Keys

!pip install -q openai "arize-phoenix>=8.8.0" "arize-phoenix-otel>=0.8.0" openinference-instrumentation-openai python-dotenv duckdb "openinference-instrumentation>=0.1.21"
import dotenv

dotenv.load_dotenv()

import json
import os
from getpass import getpass

import duckdb
import pandas as pd
from IPython.display import Markdown
from openai import OpenAI
from openinference.instrumentation import (
    suppress_tracing,
)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import StatusCode
from pydantic import BaseModel, Field
from tqdm import tqdm

from phoenix.otel import register
if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI()
model = "gpt-4o-mini"
project_name = "talk-to-your-data-agent"

Enable Phoenix Tracing

if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"
tracer_provider = register(
    project_name=project_name,
)

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

tracer = tracer_provider.get_tracer(__name__)

Prepare dataset

Your agent will interact with a local database. Start by loading in that data:

store_sales_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/llama-index/Store_Sales_Price_Elasticity_Promotions_Data.parquet"
)
store_sales_df.head()

Define the tools

Now you can define your agent tools.

Tool 1: Database Lookup

SQL_GENERATION_PROMPT = """
Generate an SQL query based on a prompt. Do not reply with anything besides the SQL query.
The prompt is: {prompt}

The available columns are: {columns}
The table name is: {table_name}
"""


def generate_sql_query(prompt: str, columns: list, table_name: str) -> str:
    """Generate an SQL query based on a prompt"""
    formatted_prompt = SQL_GENERATION_PROMPT.format(
        prompt=prompt, columns=columns, table_name=table_name
    )

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": formatted_prompt}],
    )

    return response.choices[0].message.content


@tracer.tool()
def lookup_sales_data(prompt: str) -> str:
    """Implementation of sales data lookup from parquet file using SQL"""
    try:
        table_name = "sales"
        # Read the parquet file into a DuckDB table
        duckdb.sql(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM store_sales_df")

        print(store_sales_df.columns)
        print(table_name)
        sql_query = generate_sql_query(prompt, store_sales_df.columns, table_name)
        sql_query = sql_query.strip()
        sql_query = sql_query.replace("```sql", "").replace("```", "")

        with tracer.start_as_current_span(
            "execute_sql_query", openinference_span_kind="chain"
        ) as span:
            span.set_input(value=sql_query)

            # Execute the SQL query
            result = duckdb.sql(sql_query).df()
            span.set_output(value=str(result))
            span.set_status(StatusCode.OK)
        return result.to_string()
    except Exception as e:
        return f"Error accessing data: {str(e)}"
example_data = lookup_sales_data("Show me all the sales for store 1320 on November 1st, 2021")
example_data

Tool 2: Data Visualization

class VisualizationConfig(BaseModel):
    chart_type: str = Field(..., description="Type of chart to generate")
    x_axis: str = Field(..., description="Name of the x-axis column")
    y_axis: str = Field(..., description="Name of the y-axis column")
    title: str = Field(..., description="Title of the chart")


@tracer.chain()
def extract_chart_config(data: str, visualization_goal: str) -> dict:
    """Generate chart visualization configuration

    Args:
        data: String containing the data to visualize
        visualization_goal: Description of what the visualization should show

    Returns:
        Dictionary containing line chart configuration
    """
    prompt = f"""Generate a chart configuration based on this data: {data}
    The goal is to show: {visualization_goal}"""

    response = client.beta.chat.completions.parse(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format=VisualizationConfig,
    )

    try:
        # Extract axis and title info from response
        content = response.choices[0].message.content

        # Return structured chart config
        return {
            "chart_type": content.chart_type,
            "x_axis": content.x_axis,
            "y_axis": content.y_axis,
            "title": content.title,
            "data": data,
        }
    except Exception:
        return {
            "chart_type": "line",
            "x_axis": "date",
            "y_axis": "value",
            "title": visualization_goal,
            "data": data,
        }


@tracer.chain()
def create_chart(config: VisualizationConfig) -> str:
    """Create a chart based on the configuration"""
    prompt = f"""Write python code to create a chart based on the following configuration.
    Only return the code, no other text.
    config: {config}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    code = response.choices[0].message.content
    code = code.replace("```python", "").replace("```", "")
    code = code.strip()

    return code


@tracer.tool()
def generate_visualization(data: str, visualization_goal: str) -> str:
    """Generate a visualization based on the data and goal"""
    config = extract_chart_config(data, visualization_goal)
    code = create_chart(config)
    return code
# code = generate_visualization(example_data, "A line chart of sales over each day in november.")
@tracer.tool()
def run_python_code(code: str) -> str:
    """Execute Python code in a restricted environment"""
    # Create restricted globals/locals dictionaries with plotting libraries
    restricted_globals = {
        "__builtins__": {
            "print": print,
            "len": len,
            "range": range,
            "sum": sum,
            "min": min,
            "max": max,
            "int": int,
            "float": float,
            "str": str,
            "list": list,
            "dict": dict,
            "tuple": tuple,
            "set": set,
            "round": round,
            "__import__": __import__,
            "json": __import__("json"),
        },
        "plt": __import__("matplotlib.pyplot"),
        "pd": __import__("pandas"),
        "np": __import__("numpy"),
        "sns": __import__("seaborn"),
    }

    try:
        # Execute code in restricted environment
        exec_locals = {}
        exec(code, restricted_globals, exec_locals)

        # Capture any printed output or return the plot
        exec_locals.get("__builtins__", {}).get("_", "")
        if "plt" in exec_locals:
            return exec_locals["plt"]

        # Try to parse output as JSON before returning
        return "Code executed successfully"

    except Exception as e:
        return f"Error executing code: {str(e)}"

Tool 3: Data Analysis

@tracer.tool()
def analyze_sales_data(prompt: str, data: str) -> str:
    """Implementation of AI-powered sales data analysis"""
    # Construct prompt based on analysis type and data subset
    prompt = f"""Analyze the following data: {data}
    Your job is to answer the following question: {prompt}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    analysis = response.choices[0].message.content
    return analysis if analysis else "No analysis could be generated"
# analysis = analyze_sales_data("What is the most popular product SKU?", example_data)
# analysis

Tool Schema:

You'll need to pass your tool descriptions into your agent router. The following code allows you to easily do so:

# Define tools/functions that can be called by the model
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_sales_data",
            "description": "Look up data from Store Sales Price Elasticity Promotions dataset",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    }
                },
                "required": ["prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "analyze_sales_data",
            "description": "Analyze sales data to extract insights",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    },
                },
                "required": ["data", "prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "generate_visualization",
            "description": "Generate Python code to create data visualizations",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "visualization_goal": {
                        "type": "string",
                        "description": "The goal of the visualization.",
                    },
                },
                "required": ["data", "visualization_goal"],
            },
        },
    },
    # {
    #     "type": "function",
    #     "function": {
    #         "name": "run_python_code",
    #         "description": "Run Python code in a restricted environment",
    #         "parameters": {
    #             "type": "object",
    #             "properties": {
    #                 "code": {"type": "string", "description": "The Python code to run."}
    #             },
    #             "required": ["code"]
    #         }
    #     }
    # }
]

# Dictionary mapping function names to their implementations
tool_implementations = {
    "lookup_sales_data": lookup_sales_data,
    "analyze_sales_data": analyze_sales_data,
    "generate_visualization": generate_visualization,
    # "run_python_code": run_python_code
}

Agent logic

With the tools defined, you're ready to define the main routing and tool call handling steps of your agent.

@tracer.chain()
def handle_tool_calls(tool_calls, messages):
    for tool_call in tool_calls:
        function = tool_implementations[tool_call.function.name]
        function_args = json.loads(tool_call.function.arguments)
        result = function(**function_args)

        messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})
    return messages
def start_main_span(messages):
    print("Starting main span with messages:", messages)

    with tracer.start_as_current_span("AgentRun", openinference_span_kind="agent") as span:
        span.set_input(value=messages)
        ret = run_agent(messages)
        print("Main span completed with return value:", ret)
        span.set_output(value=ret)
        span.set_status(StatusCode.OK)
        return ret


def run_agent(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router call span")
        with tracer.start_as_current_span(
            "router_call",
            openinference_span_kind="chain",
        ) as span:
            span.set_input(value=messages)

            response = client.chat.completions.create(
                model=model,
                messages=messages,
                tools=tools,
            )

            messages.append(response.choices[0].message.model_dump())
            tool_calls = response.choices[0].message.tool_calls
            print("Received response with tool calls:", bool(tool_calls))
            span.set_status(StatusCode.OK)

            if tool_calls:
                # Tool calls span
                print("Processing tool calls")
                messages = handle_tool_calls(tool_calls, messages)
                span.set_output(value=tool_calls)
            else:
                print("No tool calls, returning final response")
                span.set_output(value=response.choices[0].message.content)

                return response.choices[0].message.content

Run the agent

Your agent is now good to go! Let's try it out with some example questions:

ret = start_main_span([{"role": "user", "content": "Create a line chart showing sales in 2021"}])
print(Markdown(ret))
agent_questions = [
    "What was the most popular product SKU?",
    "What was the total revenue across all stores?",
    "Which store had the highest sales volume?",
    "Create a bar chart showing total sales by store",
    "What percentage of items were sold on promotion?",
    "Plot daily sales volume over time",
    "What was the average transaction value?",
    "Create a box plot of transaction values",
    "Which products were frequently purchased together?",
    "Plot a line graph showing the sales trend over time with a 7-day moving average",
]

for question in tqdm(agent_questions, desc="Processing questions"):
    try:
        ret = start_main_span([{"role": "user", "content": question}])
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        continue

Evaluating the agent

So your agent looks like it's working, but how can you measure its performance?

OpenAIInstrumentor().uninstrument()  # Uninstrument the OpenAI client to avoid capturing LLM as a Judge evaluation calls in your same project.
import nest_asyncio

import phoenix as px
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify
from phoenix.experiments import evaluate_experiment, run_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

nest_asyncio.apply()
px_client = px.Client()
eval_model = OpenAIModel(model="gpt-4o-mini")

Function Calling Evals using LLM as a Judge

This first evaluation will evaluate your agent router choices using another LLM.

It follows a standard pattern:

  1. Export traces from Phoenix

  2. Prepare those exported traces in a dataframe with the correct columns

  3. Use llm_classify to run a standard template across each row of that dataframe and produce an eval label

  4. Upload the results back into Phoenix

query = (
    SpanQuery()
    .where(
        "span_kind == 'LLM'",
    )
    .select(question="input.value", output_messages="llm.output_messages")
)

# The Phoenix Client can take this query and return the dataframe.
tool_calls_df = px.Client().query_spans(query, project_name=project_name, timeout=None)
tool_calls_df.dropna(subset=["output_messages"], inplace=True)


def get_tool_call(outputs):
    if outputs[0].get("message").get("tool_calls"):
        return (
            outputs[0]
            .get("message")
            .get("tool_calls")[0]
            .get("tool_call")
            .get("function")
            .get("name")
        )
    else:
        return "No tool used"


tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_call)
tool_calls_df.head()
tool_call_eval = llm_classify(
    dataframe=tool_calls_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
        "{tool_definitions}",
        "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
    ),
    rails=["correct", "incorrect"],
    model=eval_model,
    provide_explanation=True,
)

tool_call_eval["score"] = tool_call_eval.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

tool_call_eval.head()
px.Client().log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval),
)

You should now see eval labels in Phoenix.

Function Calling Evals using Ground Truth

The above example works, however if you have ground truth labled data, you can use that data to get an even more accurate measure of your router's performance by running an experiments.

Experiments also follow a standard step-by-step process in Phoenix:

  1. Create a dataset of test cases, and optionally, expected outputs

  2. Create a task to run on each test case - usually this is invoking your agent or a specifc step of it

  3. Create evaluator(s) to run on each output of your task

  4. Visualize results in Phoenix

import uuid

id = str(uuid.uuid4())

agent_tool_responses = {
    "What was the most popular product SKU?": "lookup_sales_data, analyze_sales_data",
    "What was the total revenue across all stores?": "lookup_sales_data, analyze_sales_data",
    "Which store had the highest sales volume?": "lookup_sales_data, analyze_sales_data",
    "Create a bar chart showing total sales by store": "generate_visualization, lookup_sales_data, run_python_code",
    "What percentage of items were sold on promotion?": "lookup_sales_data, analyze_sales_data",
    "Plot daily sales volume over time": "generate_visualization, lookup_sales_data, run_python_code",
    "What was the average transaction value?": "lookup_sales_data, analyze_sales_data",
    "Create a box plot of transaction values": "generate_visualization, lookup_sales_data, run_python_code",
    "Which products were frequently purchased together?": "lookup_sales_data, analyze_sales_data",
    "Plot a line graph showing the sales trend over time with a 7-day moving average": "generate_visualization, lookup_sales_data, run_python_code",
}


tool_calling_df = pd.DataFrame(agent_tool_responses.items(), columns=["question", "tool_calls"])
dataset = px_client.upload_dataset(
    dataframe=tool_calling_df,
    dataset_name=f"tool_calling_ground_truth_{id}",
    input_keys=["question"],
    output_keys=["tool_calls"],
)

For your task, you can simply run just the router call of your agent:

def run_router_step(example: Example) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
    ]
    messages.append({"role": "user", "content": example.input.get("question")})

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        tools=tools,
    )
    tool_calls = []
    for tool_call in response.choices[0].message.tool_calls:
        tool_calls.append(tool_call.function.name)
    return tool_calls

Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:

def tools_match(expected: str, output: str) -> bool:
    expected_tools = expected.get("tool_calls").split(", ")
    return expected_tools == output
experiment = run_experiment(
    dataset,
    run_router_step,
    evaluators=[tools_match],
    experiment_name="Tool Calling Eval",
    experiment_description="Evaluating the tool calling step of the agent",
)

Tool Evals

Evaluating our SQL generation tool

# This step will be replaced by a human annotated set of ground truth data, instead of generated examples

db_lookup_questions = [
    "What was the most popular product SKU?",
    "Which store had the highest total sales value?",
    "How many items were sold on promotion?",
    "What was the average quantity sold per transaction?",
    "Which product class code generated the most revenue?",
    "What day of the week had the highest sales volume?",
    "How many unique stores made sales?",
    "What was the highest single transaction value?",
    "Which products were frequently sold together?",
    "What's the trend in sales over time?",
]

expected_results = []

for question in tqdm(db_lookup_questions, desc="Processing SQL lookup questions"):
    try:
        with suppress_tracing():
            expected_results.append(lookup_sales_data(question))
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        db_lookup_questions.remove(question)

# Create a DataFrame with the questions
questions_df = pd.DataFrame({"question": db_lookup_questions, "expected_result": expected_results})

display(questions_df)
dataset = px_client.upload_dataset(
    dataframe=questions_df,
    dataset_name=f"sales_db_lookup_questions_{id}",
    input_keys=["question"],
    output_keys=["expected_result"],
)
def run_sql_query(example: Example) -> str:
    with suppress_tracing():
        return lookup_sales_data(example.input.get("question"))
def evaluate_sql_result(output: str, expected: str) -> bool:
    # Extract just the numbers from both strings
    result_nums = "".join(filter(str.isdigit, output))
    expected_nums = "".join(filter(str.isdigit, expected.get("expected_result")))
    return result_nums == expected_nums
experiment = run_experiment(
    dataset,
    run_sql_query,
    evaluators=[evaluate_sql_result],
    experiment_name="SQL Query Eval",
    experiment_description="Evaluating the SQL query generation step of the agent",
)

Evaluating our Python code generation tool

# Replace this with a human annotated set of ground truth data, instead of generated examples

code_generation_questions = [
    "Create a bar chart showing total sales by store",
    "Plot daily sales volume over time",
    "Plot a line graph showing the sales trend over time with a 7-day moving average",
    "Create a histogram of quantities sold per transaction",
    "Generate a pie chart showing sales distribution across product classes",
    "Create a stacked bar chart showing promotional vs non-promotional sales by store",
    "Generate a heatmap of sales by day of week and store number",
    "Plot a line chart comparing sales trends between top 5 stores",
]

example_data = []
chart_configs = []
for question in tqdm(code_generation_questions[:], desc="Processing code generation questions"):
    try:
        with suppress_tracing():
            example_data.append(lookup_sales_data(question))
            chart_configs.append(json.dumps(extract_chart_config(example_data[-1], question)))
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        code_generation_questions.remove(question)

code_generation_df = pd.DataFrame(
    {
        "question": code_generation_questions,
        "example_data": example_data,
        "chart_configs": chart_configs,
    }
)

dataset = px_client.upload_dataset(
    dataframe=code_generation_df,
    dataset_name=f"code_generation_questions_{id}",
    input_keys=["question", "example_data", "chart_configs"],
)
def run_code_generation(example: Example) -> str:
    with suppress_tracing():
        chart_config = extract_chart_config(
            data=example.input.get("example_data"), visualization_goal=example.input.get("question")
        )
        code = generate_visualization(
            visualization_goal=example.input.get("question"), data=example.input.get("example_data")
        )

    return {"code": code, "chart_config": chart_config}

In this case, you don't have ground truth data to compare to. Instead you can just use a simple code evaluator: trying to run the generated code and catching any errors.

def code_is_runnable(output: str) -> bool:
    """Check if the code is runnable"""
    output = output.get("code")
    output = output.strip()
    output = output.replace("```python", "").replace("```", "")
    try:
        exec(output)
        return True
    except Exception:
        return False
def evaluate_chart_config(output: str, expected: str) -> bool:
    return output.get("chart_config") == expected.get("chart_config")
experiment = run_experiment(
    dataset,
    run_code_generation,
    evaluators=[code_is_runnable, evaluate_chart_config],
    experiment_name="Code Generation Eval",
    experiment_description="Evaluating the code generation step of the agent",
)

Evaluating the agent path and convergence

Finally, the last piece of your agent to evaluate is its path. This is important to evaluate to understand how efficient your agent is in its execution. Does it need to call the same tool multiple times? Does it skip steps it shouldn't, and have to backtrack later? Convergence or path evals can tell you this.

Convergence evals operate slightly differently. The one you'll use below relies on knowing the minimum number of steps taken by the agent for a given type of query. Instead of just running an experiment, you'll run an experiment then after it completes, attach a second evaluator to calculate convergence.

The workflow is as follows:

  1. Create a dataset of the same type of question, phrased different ways each time - the agent should take the same path for each, but you'll often find it doesn't.

  2. Create a task that runs the agent on each question, while tracking the number of steps it takes.

  3. Run the experiment without an evaluator.

  4. Calculate the minimum number of steps taken to complete the task.

  5. Create an evaluator that compares the steps taken of each run against that min step number.

  6. Run this evaluator on your experiment from step 3.

  7. View your results in Phoenix

# Replace this with a human annotated set of ground truth data, instead of generated examples

convergence_questions = [
    "What was the average quantity sold per transaction?",
    "What is the mean number of items per sale?",
    "Calculate the typical quantity per transaction",
    "Show me the average number of units sold in each transaction",
    "What's the mean transaction size in terms of quantity?",
    "On average, how many items were purchased per transaction?",
    "What is the average basket size per sale?",
    "Calculate the mean number of products per purchase",
    "What's the typical number of units per order?",
    "Find the average quantity of items in each transaction",
    "What is the average number of products bought per purchase?",
    "Tell me the mean quantity of items in a typical transaction",
    "How many items does a customer buy on average per transaction?",
    "What's the usual number of units in each sale?",
    "Calculate the average basket quantity per order",
    "What is the typical amount of products per transaction?",
    "Show the mean number of items customers purchase per visit",
    "What's the average quantity of units per shopping trip?",
    "How many products do customers typically buy in one transaction?",
    "What is the standard basket size in terms of quantity?",
]

convergence_df = pd.DataFrame({"question": convergence_questions})

dataset = px_client.upload_dataset(
    dataframe=convergence_df, dataset_name="convergence_questions", input_keys=["question"]
)
def format_message_steps(messages):
    """
    Convert a list of message objects into a readable format that shows the steps taken.

    Args:
        messages (list): A list of message objects containing role, content, tool calls, etc.

    Returns:
        str: A readable string showing the steps taken.
    """
    steps = []
    for message in messages:
        role = message.get("role")
        if role == "user":
            steps.append(f"User: {message.get('content')}")
        elif role == "system":
            steps.append("System: Provided context")
        elif role == "assistant":
            if message.get("tool_calls"):
                for tool_call in message["tool_calls"]:
                    tool_name = tool_call["function"]["name"]
                    steps.append(f"Assistant: Called tool '{tool_name}'")
            else:
                steps.append(f"Assistant: {message.get('content')}")
        elif role == "tool":
            steps.append(f"Tool response: {message.get('content')}")

    return "\n".join(steps)
def run_agent_and_track_path(example: Example) -> str:
    print("Starting main span with messages:", example.input.get("question"))
    messages = [{"role": "user", "content": example.input.get("question")}]
    ret = run_agent_messages(messages)
    return {"path_length": len(ret), "messages": format_message_steps(ret)}


def run_agent_messages(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router")

        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )

        messages.append(response.choices[0].message.model_dump())
        tool_calls = response.choices[0].message.tool_calls
        print("Received response with tool calls:", bool(tool_calls))

        if tool_calls:
            # Tool calls span
            print("Processing tool calls")
            tool_calls = response.choices[0].message.tool_calls
            messages = handle_tool_calls(tool_calls, messages)
        else:
            print("No tool calls, returning final response")
            return messages
experiment = run_experiment(
    dataset,
    run_agent_and_track_path,
    experiment_name="Convergence Eval",
    experiment_description="Evaluating the convergence of the agent",
)
experiment.as_dataframe()
outputs = experiment.as_dataframe()["output"].to_dict().values()
optimal_path_length = min(
    output.get("path_length")
    for output in outputs
    if output and output.get("path_length") is not None
)
print(f"The optimal path length is {optimal_path_length}")
@create_evaluator(name="Convergence Eval", kind="CODE")
def evaluate_path_length(output: str) -> float:
    if output and output.get("path_length"):
        return optimal_path_length / float(output.get("path_length"))
    else:
        return 0
experiment = evaluate_experiment(experiment, evaluators=[evaluate_path_length])

Advanced - Combining all the evals into our experiment

As an optional final step, you can combine all the evaluators and experiments above into a single experiment. This requires some more advanced data wrangling, but gives you a single report on your agent's performance.

Build a version of our agent that tracks all the necessary information for evals

def process_messages(messages):
    tool_calls = []
    tool_responses = []
    final_output = None

    for i, message in enumerate(messages):
        # Extract tool calls
        if "tool_calls" in message and message["tool_calls"]:
            for tool_call in message["tool_calls"]:
                tool_name = tool_call["function"]["name"]
                tool_input = tool_call["function"]["arguments"]
                tool_calls.append(tool_name)

                # Prepare tool response structure with tool name and input
                tool_responses.append(
                    {"tool_name": tool_name, "tool_input": tool_input, "tool_response": None}
                )

        # Extract tool responses
        if message["role"] == "tool" and "tool_call_id" in message:
            for tool_response in tool_responses:
                if message["tool_call_id"] in message.values():
                    tool_response["tool_response"] = message["content"]

        # Extract final output
        if (
            message["role"] == "assistant"
            and not message.get("tool_calls")
            and not message.get("function_call")
        ):
            final_output = message["content"]

    result = {
        "tool_calls": tool_calls,
        "tool_responses": tool_responses,
        "final_output": final_output,
        "unchanged_messages": messages,
        "path_length": len(messages),
    }

    return result
def run_agent_and_track_path_combined(example: Example) -> str:
    print("Starting main span with messages:", example.input.get("question"))
    messages = [{"role": "user", "content": example.input.get("question")}]
    ret = run_agent_messages_combined(messages)
    return process_messages(ret)


def run_agent_messages_combined(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router")

        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )

        messages.append(response.choices[0].message.model_dump())
        tool_calls = response.choices[0].message.tool_calls
        print("Received response with tool calls:", bool(tool_calls))

        if tool_calls:
            # Tool calls span
            print("Processing tool calls")
            tool_calls = response.choices[0].message.tool_calls
            messages = handle_tool_calls(tool_calls, messages)
        else:
            print("No tool calls, returning final response")
            return messages
generate_sql_query("What was the most popular product SKU?", store_sales_df.columns, "sales")
overall_experiment_questions = [
    {
        "question": "What was the most popular product SKU?",
        "sql_result": "   SKU_Coded  Total_Qty_Sold 0    6200700         52262.0",
    },
    {
        "question": "What was the total revenue across all stores?",
        "sql_result": "   Total_Revenue 0   1.327264e+07",
    },
    {
        "question": "Which store had the highest sales volume?",
        "sql_result": "   Store_Number  Total_Sales_Volume 0          2970             59322.0",
    },
    {
        "question": "Create a bar chart showing total sales by store",
        "sql_result": "    Store_Number    Total_Sales 0            880  420302.088397 1           1650  580443.007953 2           4180  272208.118542 3            550  229727.498752 4           1100  497509.528013 5           3300  619660.167018 6           3190  335035.018792 7           2970  836341.327191 8           3740  359729.808228 9           2530  324046.518720 10          4400   95745.620250 11          1210  508393.767785 12           330  370503.687331 13          2750  453664.808068 14          1980  242290.828499 15          1760  350747.617798 16          3410  410567.848126 17           990  378433.018639 18          4730  239711.708869 19          4070  322307.968330 20          3080  495458.238811 21          2090  309996.247965 22          1320  592832.067579 23          2640  308990.318559 24          1540  427777.427815 25          4840  389056.668316 26          2860  132320.519487 27          2420  406715.767402 28           770  292968.918642 29          3520  145701.079372 30           660  343594.978075 31          3630  405034.547846 32          2310  412579.388504 33          2200  361173.288199 34          1870  401070.997685",
    },
    {
        "question": "What percentage of items were sold on promotion?",
        "sql_result": "   Promotion_Percentage 0              0.625596",
    },
    {
        "question": "What was the average transaction value?",
        "sql_result": "   Average_Transaction_Value 0                  19.018132",
    },
    {
        "question": "Create a line chart showing sales in 2021",
        "sql_result": "  sale_month  total_quantity_sold  total_sales_value 0 2021-11-01              43056.0      499984.428193 1 2021-12-01              75724.0      910982.118423",
    },
]

overall_experiment_questions[0]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[0]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[1]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[1]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[2]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[2]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[3]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[3]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[4]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[4]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[5]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[5]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[6]["sql_generated"] = generate_sql_query(
    overall_experiment_questions[6]["question"], store_sales_df.columns, "sales"
)

print(overall_experiment_questions[6])

# overall_experiment_df = pd.DataFrame(overall_experiment_questions)

# dataset = px_client.upload_dataset(dataframe=overall_experiment_df, dataset_name="overall_experiment_questions_all", input_keys=["question"], output_keys=["sql_result"])
print(overall_experiment_questions[6])
[
    {
        "question": "What was the most popular product SKU?",
        "sql_result": "   SKU_Coded  Total_Qty_Sold 0    6200700         52262.0",
        "sql_generated": "```sql\nSELECT SKU_Coded, SUM(Qty_Sold) AS Total_Qty_Sold\nFROM sales\nGROUP BY SKU_Coded\nORDER BY Total_Qty_Sold DESC\nLIMIT 1;\n```",
    },
    {
        "question": "What was the total revenue across all stores?",
        "sql_result": "   Total_Revenue 0   1.327264e+07",
        "sql_generated": "```sql\nSELECT SUM(Total_Sale_Value) AS Total_Revenue\nFROM sales;\n```",
    },
    {
        "question": "Which store had the highest sales volume?",
        "sql_result": "   Store_Number  Total_Sales_Volume 0          2970             59322.0",
        "sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales_Volume\nFROM sales\nGROUP BY Store_Number\nORDER BY Total_Sales_Volume DESC\nLIMIT 1;\n```",
    },
    {
        "question": "Create a bar chart showing total sales by store",
        "sql_result": "    Store_Number    Total_Sales 0            880  420302.088397 1           1650  580443.007953 2           4180  272208.118542 3            550  229727.498752 4           1100  497509.528013 5           3300  619660.167018 6           3190  335035.018792 7           2970  836341.327191 8           3740  359729.808228 9           2530  324046.518720 10          4400   95745.620250 11          1210  508393.767785 12           330  370503.687331 13          2750  453664.808068 14          1980  242290.828499 15          1760  350747.617798 16          3410  410567.848126 17           990  378433.018639 18          4730  239711.708869 19          4070  322307.968330 20          3080  495458.238811 21          2090  309996.247965 22          1320  592832.067579 23          2640  308990.318559 24          1540  427777.427815 25          4840  389056.668316 26          2860  132320.519487 27          2420  406715.767402 28           770  292968.918642 29          3520  145701.079372 30           660  343594.978075 31          3630  405034.547846 32          2310  412579.388504 33          2200  361173.288199 34          1870  401070.997685",
        "sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nGROUP BY Store_Number;\n```",
    },
    {
        "question": "What percentage of items were sold on promotion?",
        "sql_result": "   Promotion_Percentage 0              0.625596",
        "sql_generated": "```sql\nSELECT \n    (SUM(CASE WHEN On_Promo = 'Yes' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS Promotion_Percentage\nFROM \n    sales;\n```",
    },
    {
        "question": "What was the average transaction value?",
        "sql_result": "   Average_Transaction_Value 0                  19.018132",
        "sql_generated": "```sql\nSELECT AVG(Total_Sale_Value) AS Average_Transaction_Value\nFROM sales;\n```",
    },
    {
        "question": "Create a line chart showing sales in 2021",
        "sql_result": "  sale_month  total_quantity_sold  total_sales_value 0 2021-11-01              43056.0      499984.428193 1 2021-12-01              75724.0      910982.118423",
        "sql_generated": "```sql\nSELECT MONTH(Sold_Date) AS Month, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nWHERE YEAR(Sold_Date) = 2021\nGROUP BY MONTH(Sold_Date)\nORDER BY MONTH(Sold_Date);\n```",
    },
]
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one
that is vague, disorganized, or difficult to understand, even if it may be factually correct.

Your response should be a single word: either "clear" or "unclear," and it should not include any other
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and
appropriately addresses the query. "unclear" indicates that the answer is ambiguous, poorly organized, or
not effectively communicated. Please carefully consider the query and answer before determining your
response.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your
explanation. Your reasoning should include specific points about how the answer does or does not meet the
criteria for clarity.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""

ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to determine whether all
the entities mentioned in the answer are correctly identified and accurately match those in the query. An
entity refers to any specific person, place, organization, date, or other proper noun. Your evaluation
should focus on whether the entities in the answer are correctly named and appropriately associated with
the context in the query.

Your response should be a single word: either "correct" or "incorrect," and it should not include any
other text or characters. "correct" indicates that all entities mentioned in the answer match those in the
query and are properly identified. "incorrect" indicates that the answer contains errors or mismatches in
the entities referenced compared to the query.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "correct" or "incorrect." Avoid stating the final label at the beginning of
your explanation. Your reasoning should include specific points about how the entities in the answer do or
do not match the entities in the query.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating whether the entities in the answer are
correct and consistent with the query.
LABEL: "correct" or "incorrect"
"""
TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json.dumps(tools))
def function_calling_eval(input: str, output: str) -> float:
    function_calls = output.get("tool_calls")
    if function_calls:
        eval_df = pd.DataFrame(
            {"question": [input.get("question")] * len(function_calls), "tool_call": function_calls}
        )

        tool_call_eval = llm_classify(
            dataframe=eval_df,
            template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
                "{tool_definitions}", json.dumps(tools).replace("{", '"').replace("}", '"')
            ),
            rails=["correct", "incorrect"],
            model=eval_model,
            provide_explanation=True,
        )

        tool_call_eval["score"] = tool_call_eval.apply(
            lambda x: 1 if x["label"] == "correct" else 0, axis=1
        )
        return tool_call_eval["score"].mean()
    else:
        return 0


def code_is_runnable(output: str) -> bool:
    """Check if the code is runnable"""
    generated_code = output.get("tool_responses")
    if not generated_code:
        return True

    # Find first lookup_sales_data response
    generated_code = next(
        (r for r in generated_code if r.get("tool_name") == "generate_visualization"), None
    )
    if not generated_code:
        return True

    # Get the first response
    generated_code = generated_code.get("tool_response", "")
    generated_code = generated_code.strip()
    generated_code = generated_code.replace("```python", "").replace("```", "")
    try:
        exec(generated_code)
        return True
    except Exception:
        return False


def evaluate_sql_result(output, expected) -> bool:
    sql_result = output.get("tool_responses")
    if not sql_result:
        return True

    # Find first lookup_sales_data response
    sql_result = next((r for r in sql_result if r.get("tool_name") == "lookup_sales_data"), None)
    if not sql_result:
        return True

    # Get the first response
    sql_result = sql_result.get("tool_response", "")

    # Extract just the numbers from both strings
    result_nums = "".join(filter(str.isdigit, sql_result))
    expected_nums = "".join(filter(str.isdigit, expected.get("sql_result")))
    return result_nums == expected_nums


def evaluate_clarity(output: str, input: str) -> bool:
    df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
    response = llm_classify(
        dataframe=df,
        template=CLARITY_LLM_JUDGE_PROMPT,
        rails=["clear", "unclear"],
        model=eval_model,
        provide_explanation=True,
    )
    return response["label"] == "clear"


def evaluate_entity_correctness(output: str, input: str) -> bool:
    df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
    response = llm_classify(
        dataframe=df,
        template=ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT,
        rails=["correct", "incorrect"],
        model=eval_model,
        provide_explanation=True,
    )
    return response["label"] == "correct"
def run_overall_experiment(example: Example) -> str:
    with suppress_tracing():
        return run_agent_and_track_path_combined(example)


experiment = run_experiment(
    dataset,
    run_overall_experiment,
    evaluators=[
        function_calling_eval,
        evaluate_sql_result,
        evaluate_clarity,
        evaluate_entity_correctness,
        code_is_runnable,
    ],
    experiment_name="Overall Experiment",
    experiment_description="Evaluating the overall experiment",
)

Congratulations! 🎉

You've now evaluated every aspect of your agent. If you've made it this far, you're now an expert in evaluating agent routers, tools, and paths!

Sign up for a free instance of to get your API key. If you'd prefer, you can instead .

Agent Traces

The next piece of your agent to evaluate is its tools. Each tool is usually evaluated differently - we've included some examples below. If you need other ideas, give you an idea of other metrics to use.

Phoenix Cloud
self-host Phoenix
Phoenix's built-in evaluators
Google Colab
View in colab
Logo