Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Use embeddings to explore lower-dimensional representations of your data, identifying clusters of high drift and performance degradation. Complement this with statistical analysis of structured data for A/B testing, temporal drift detection, and deeper performance insights.
Chatbot with User Feedback
or
Embeddings Analysis: Data Exploration
Embeddings Analysis: Model Performance
Structured Data Analysis
Iteratively improve your LLM task by building datasets, running experiments, and evaluating performance using code and LLM-as-a-judge.
Leverage the power of large language models to evaluate your generative model or application for hallucinations, toxicity, relevance of retrieved documents, and more.
For each described in the inference (s), Phoenix serves a embeddings troubleshooting view to help you identify areas of drift and performance degradation. Let's start with embedding drift.
The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your inference set is diverging from a given reference baseline.
Note that when you are troubleshooting search and retrieval using inferences, the euclidean distance of your queries to your knowledge base vectors is presented as query distance.
Moments of high euclidean distance is an indication that the primary inference set is starting to drift from the reference inference set. As the primary inferences move further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.
For an in-depth guide of euclidean distance and embedding drift, check out
In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.
Phoenix automatically breaks up your embeddings into groups of inferences using a clustering algorithm called . This is particularly useful if you are trying to identify areas of your embeddings that are drifting or performing badly.
When twos are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary inferences than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.
Phoenix projects the embeddings you provided into lower dimensional space (3 dimensions) using a dimension reduction algorithm called (stands for Uniform Manifold Approximation and Projection). This lets us understand how your in a visually understandable way.
In addition to the point-cloud, another dimension we have at our disposal is color (and in some cases shape). Out of the box phoenix let's you assign colors to the UMAP point-cloud by dimension (features, tags, predictions, actuals), performance (correctness which distinguishes true positives and true negatives from the incorrect predictions), and inference (to highlight areas of drift). This helps you explore your point-cloud from different perspectives depending on what you are looking for.
Trace through the execution of your LLM application to understand its internal structure and to troubleshoot issues with retrieval, tool execution, LLM calls, and more.
Comprehensive Use Cases
RAG Use Cases
Code Generation Agent
Explore a Code Generator Copilot Agent designed to generate, optimize, and validate code.
RAG Agent
Enter a source URL and collect traces in Phoenix to see how a RAG Agent can retrieve and generate accurate responses.
Computer Use Agent
Test out a Computer Use (Operator) Agent built to execute commands, edit files, and manage system operations.
LLM Evaluations
Evaluations Use Cases
Evaluating and Improving RAG Applications
Tracing Applications
Tracing Use Cases
Tracing with Sessions
Open AI Functions
Data extraction tasks using LLMs, such as scraping text from documents or pulling key information from paragraphs, are on the rise. Using an LLM for this task makes sense - LLMs are great at inherently capturing the structure of language, so extracting that structure from text using LLM prompting is a low cost, high scale method to pull out relevant data from unstructured text.
One approach is using a flattened schema. Let's say you're dealing with extracting information for a trip planning application. The query may look something like:
User: I need a budget-friendly hotel in San Francisco close to the Golden Gate Bridge for a family vacation. What do you recommend?
As the application designer, the schema you may care about here for downstream usage could be a flattened representation looking something like:
{
budget: "low",
location: "San Francisco",
purpose: "pleasure"
}
With the above extracted attributes, your downstream application can now construct a structured query to find options that might be relevant to the user.
Structured extraction is a place where it’s simplest to work directly with the OpenAI function calling API. Open AI functions for structured data extraction recommends providing the following JSON schema object in the form ofparameters_schema
(the desired fields for structured data output).
parameters_schema = {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The desired destination location. Use city, state, and country format when possible. If no destination is provided, return "unstated".',
},
"budget_level": {
"type": "string",
"enum": ["low", "medium", "high", "not_stated"],
"description": 'The desired budget level. If no budget level is provided, return "not_stated".',
},
"purpose": {
"type": "string",
"enum": ["business", "pleasure", "other", "non_stated"],
"description": 'The purpose of the trip. If no purpose is provided, return "not_stated".',
},
},
"required": ["location", "budget_level", "purpose"],
}
function_schema = {
"name": "record_travel_request_attributes",
"description": "Records the attributes of a travel request",
"parameters": parameters_schema,
}
system_message = (
"You are an assistant that parses and records the attributes of a user's travel request."
)
The ChatCompletion
call to Open AI would look like
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": travel_request},
],
functions=[function_schema],
# By default, the LLM will choose whether or not to call a function given the conversation context.
# The line below forces the LLM to call the function so that the output conforms to the schema.
function_call={"name": function_schema["name"]},
)
You can use phoenix spans and traces to inspect the invocation parameters of the function to
verify the inputs to the model in form of the the user message
verify your request to Open AI
verify the corresponding generated outputs from the model match what's expected from the schema and are correct
Point level evaluation is a great starting point, but verifying correctness of extraction at scale or in a batch pipeline can be challenging and expensive. Evaluating data extraction tasks performed by LLMs is inherently challenging due to factors like:
The diverse nature and format of source data.
The potential absence of a 'ground truth' for comparison.
The intricacies of context and meaning in extracted data.
To learn more about how to evaluate structured extraction applications, head to our documentation on LLM assisted evals!
Few-shot prompting is a powerful technique in prompt engineering that helps LLMs perform tasks more effectively by providing a few examples within the prompt.
Unlike zero-shot prompting, where the model must infer the task with no prior context, or one-shot prompting, where a single example is provided, few-shot prompting leverages multiple examples to guide the model’s responses more accurately.
In this tutorial you will:
Explore how different prompting strategies impact performance in a sentiment analysis task on a dataset of reviews.
Run an evaluation to measure how the prompt affects the model’s performance
Track your how your prompt and experiment changes overtime in Phoenix
By the end of this tutorial, you’ll have a clear understanding of how structured prompting can significantly enhance the results of any application.
⚠️You will need an OpenAI Key for this tutorial.
Let’s get started! 🚀
!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.
import os
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
This dataset contains reviews along with their corresponding sentiment labels. Throughout this notebook, we will use the same dataset to evaluate the impact of different prompting techniques, refining our approach with each iteration.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
from datasets import load_dataset
ds = load_dataset("syeddula/fridgeReviews")["train"]
ds = ds.to_pandas()
ds.head()
import uuid
import phoenix as px
from phoenix.client import Client as PhoenixClient
unique_id = uuid.uuid4()
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=ds,
input_keys=["Review"],
output_keys=["Sentiment"],
dataset_name=f"review-classification-{unique_id}",
)
Zero-shot prompting is a technique where a language model is asked to perform a task without being given any prior examples. Instead, the model relies solely on its pre-trained knowledge to generate a response. This approach is useful when you need quick predictions without providing specific guidance.
In this section, we will apply zero-shot prompting to our sentiment analysis dataset, asking the model to classify reviews as positive, negative, or neutral without any labeled examples. We’ll then evaluate its performance to see how well it can infer the task based on the prompt alone.
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from phoenix.client.types import PromptVersion
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": "You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications.",
},
{"role": "user", "content": "{{Review}}"},
],
)
prompt_identifier = "fridge-sentiment-reviews"
prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="A prompt for classifying reviews based on sentiment.",
version=PromptVersion.from_openai(params),
)
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
Next we will define a task and evaluator for the experiment.
Because our dataset has ground truth labels, we can use a simple function to check if the output of the task matches the expected output.
def zero_shot_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(**prompt.format(variables={"Review": input["Review"]}))
return resp.choices[0].message.content.strip()
def evaluate_response(output, expected):
return output.lower() == expected["Sentiment"].lower()
If you’d like to instrument your code, you can run the cell below. While this step isn’t required for running prompts and evaluations, it enables trace visualization for deeper insights into the model’s behavior.
from openinference.instrumentation.openai import OpenAIInstrumentor
from phoenix.otel import register
tracer_provider = register(project_name="few-shot-examples")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Finally, we run our experiment. We can view the results of the experiment in Phoenix.
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
initial_experiment = run_experiment(
dataset,
task=zero_shot_prompt,
evaluators=[evaluate_response],
experiment_description="Zero-Shot Prompt",
experiment_name="zero-shot-prompt",
experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
In the following sections, we refine the prompt to enhance the model's performance and improve the evaluation results on our dataset.
One-shot prompting provides the model with a single example to guide its response. By including a labeled example in the prompt, we give the model a clearer understanding of the task, helping it generate more accurate predictions compared to zero-shot prompting.
In this section, we will apply one-shot prompting to our sentiment analysis dataset by providing one labeled review as a reference. We’ll then evaluate how this small amount of guidance impacts the model’s ability to classify sentiments correctly.
ds = load_dataset("syeddula/fridgeReviews")["test"]
one_shot_example = ds.to_pandas().sample(1)
one_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."
Here is one example of a review and the sentiment:
{examples}
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": one_shot_template.format(examples=one_shot_example)},
{"role": "user", "content": "{{Review}}"},
],
)
one_shot_prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="One-shot prompt for classifying reviews based on sentiment.",
version=PromptVersion.from_openai(params),
)
Under the prompts tab in Phoenix, we can see that our prompt has an updated version. The prompt includes one random example from the test dataset to help the model make its classification.
Similar to the previous step, we will define the task and run the evaluator. This time, we will be using our updated prompt for One-Shot Prompting and see how the evaluation changes.
def one_shot_prompt_template(input):
client = OpenAI()
resp = client.chat.completions.create(
**one_shot_prompt.format(variables={"Review": input["Review"]})
)
return resp.choices[0].message.content.strip()
one_shot_experiment = run_experiment(
dataset,
task=one_shot_prompt_template,
evaluators=[evaluate_response],
experiment_description="One-Shot Prompting",
experiment_name="one-shot-prompt",
experiment_metadata={"prompt": "prompt_id=" + one_shot_prompt.id},
)
In this run, we observe a slight improvement in the evaluation results. Let’s see if we can further enhance performance in the next section.
Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM, the randomness of selected test examples, and other inherent model behaviors.
Finally, we will explore few-shot Prompting which enhances a model’s performance by providing multiple labeled examples within the prompt. By exposing the model to several instances of the task, it gains a better understanding of the expected output, leading to more accurate and consistent responses.
In this section, we will apply few-shot prompting to our sentiment analysis dataset by including multiple labeled reviews as references. This approach helps the model recognize patterns and improves its ability to classify sentiments correctly. We’ll then evaluate its performance to see how additional examples impact accuracy compared to zero-shot and one-shot prompting.
ds = load_dataset("syeddula/fridgeReviews")["test"]
few_shot_examples = ds.to_pandas().sample(10)
few_shot_template = """
"You are an evaluator who assesses the sentiment of a review. Output if the review positive, negative, or neutral. Only respond with one of these classifications."
Here are examples of a review and the sentiment:
{examples}
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": few_shot_template.format(examples=few_shot_examples)},
{"role": "user", "content": "{{Review}}"},
],
)
few_shot_prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Few-shot prompt for classifying reviews based on sentiment.",
version=PromptVersion.from_openai(params),
)
Our updated prompt also lives in Phoenix. We can clearly see how the linear version history of our prompt was built.
Just like previous steps, we run our task and evaluation.
def few_shot_prompt_template(input):
client = OpenAI()
resp = client.chat.completions.create(
**few_shot_prompt.format(variables={"Review": input["Review"]})
)
return resp.choices[0].message.content.strip()
few_shot_experiment = run_experiment(
dataset,
task=few_shot_prompt_template,
evaluators=[evaluate_response],
experiment_description="Few Shot Prompting",
experiment_name="few-shot-prompt",
experiment_metadata={"prompt": "prompt_id=" + few_shot_prompt.id},
)
In this final run, we observe the most significant improvement in evaluation results. By incorporating multiple examples into our prompt, we provide clearer guidance to the model, leading to better sentiment classification.
Note: Performance may still vary, and in some cases, results might decline. Like before, this is not necessarily "wrong," as factors like the choice of LLM, the randomness of selected test examples, and inherent model behaviors can all influence outcomes.
From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️
How to leverage human annotations to build evaluations and experiments that improve your system
In this tutorial, we will explore how to build a custom human annotation interface for Phoenix using . We will then leverage those annotations to construct experiments and evaluate your application.
The purpose of a custom annotations UI is to make it easy for anyone to provide structured human feedback on traces, capturing essential details directly in Phoenix. Annotations are vital for collecting feedback during human review, enabling iterative improvement of your LLM applications.
By establishing this feedback loop and an evaluation pipeline, you can effectively monitor and enhance your system’s performance.
We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or video above.
We will generate some LLM traces and send them to Phoenix. We will then annotate these traces to add labels, scores, or explanations directly onto specific spans.
We deliberately generate some bad or nonsensical traces in the system prompt to demonstrate annotating and experimenting with different types of results.
Visit our implementation here:
How to annotate your traces in Lovable:
Enter your Phoenix Cloud endpoint, API key, and project name. Optionally, also include an identifier to tie annotations to a specific user.
Click Refresh Traces.
Select the traces you want to annotate and click Send to Phoenix.
See your annotations appear instantly in Phoenix.
This tool was built using the Phoenix . For more details on how to build your own custom annotations tool to fit your needs, see .
Next, you will construct an LLM-as-a-Judge template to evaluate your experiments. This evaluator will mark nonsensical outputs as incorrect. As you experiment, you’ll see evaluation results improve. Once your annotated trace dataset shows consistent improvement, you can confidently apply these changes to your production system.
The next step is to form a hypothesis about why some outputs are failing. In our full walkthrough, we demonstrate the experimentation process by testing out different hypotheses such as swapping out models. However, for demonstration purposes, we will show an experiment that will almost certainly improve your results: modifying the weak system prompt we originally used.
Here, we expect to see improvements in our experiment. The evaluator should flag significantly fewer nonsensical answers as you have refined your system prompt.
Now that we’ve completed a successful experimentation cycle and confirmed our improvements on the annotated traces dataset, we can update the application and test the results on the broader dataset. This helps ensure that improvements made during experimentation translate effectively to real-world usage and that your system performs reliably at scale.
Here is a sample prompt you can feed into (or a similar tool) to start building your custom LLM trace annotation interface. Feel free to adjust it to your needs. Note that you will need to implement functionality to fetch spans and send annotations to Phoenix. We’ve also included a brief explanation of how we approached this in our own implementation. A tool like this can benefit teams that want to collect human annotation data without requiring annotators to work directly within the Phoenix platform. You can also configure features like “thumbs up” and “thumbs down” buttons to streamline filling in annotation fields. Once submitted, the annotations immediately appear in Phoenix.
Prompt for Lovable:
Build a platform for annotating LLM spans and traces:
Connect to Phoenix Cloud by collecting endpoint, API Key, and project name from the user
Load traces and spans from Phoenix (via or ).
Display spans grouped by trace_id, with clear visual separation.
Allow annotators to assign a label, score, and explanation to each span or entire trace.
Support sending annotations back to Phoenix and reloading to see updates.
Use a clean, modern design
Details on how we built our Annotation UI:
✅ Frontend (Lovable):
Built in Lovable for easy UI generation.
Allows loading LLM traces, displaying spans grouped by trace_id, and annotating spans with label, score, explanation.
✅ Backend (Render, FastAPI):
Hosted on Render using FastAPI.
Adds CORS for your Lovable frontend to communicate securely.
Uses two key endpoints:
GET /v1/projects/{project_identifier}/spans
POST /v1/span_annotations
questions = [
"What is the capital of France?",
"Who wrote 'Pride and Prejudice'?",
"What is the boiling point of water in Celsius?",
"What is the largest planet in our solar system?",
"Who developed the theory of relativity?",
"What is the chemical symbol for gold?",
"In which year did the Apollo 11 mission land on the moon?",
"What language has the most native speakers worldwide?",
"Which continent has the most countries?",
"What is the square root of 144?",
"What is the largest country in the world by land area?",
"Why is the sky blue?",
"Who painted the Mona Lisa?",
"What is the smallest prime number?",
"What gas do plants absorb from the atmosphere?",
"Who was the first President of the United States?",
"What is the currency of Japan?",
"How many continents are there on Earth?",
"What is the tallest mountain in the world?",
"Who is the author of '1984'?",
]
from openai import OpenAI
openai_client = OpenAI()
# System prompt
system_prompt = """
You are a question-answering assistant. For each user question, randomly choose an option: NONSENSE or RHYME. If you choose RHYME, answer correctly in the form of a rhyme.
If it NONSENSE, do not answer the question at all, and instead respond with nonsense words and random numbers that do not rhyme, ignoring the user’s question completely.
When responding with NONSENSE, include at least five nonsense words and at least five random numbers between 0 and 9999 in your response.
Do not explain your choice.
"""
# Run through the dataset and collect spans
for question in questions:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
)
import pandas as pd
import phoenix as px
from phoenix.client import Client
from phoenix.client.types import spans
client = Client()
# replace "correctness" if you chose to annotate on different criteria
query = spans.SpanQuery().where("annotations['correctness']")
spans_df = client.spans.get_spans_dataframe(query=query, project_identifier="my-annotations-app")
dataset = px.Client().upload_dataset(
dataframe=spans_df,
dataset_name="annotated-rhymes",
input_keys=["attributes.input.value"],
output_keys=["attributes.llm.output_messages"],
)
RHYME_PROMPT_TEMPLATE = """
Examine the assistant’s responses in the conversation and determine whether the assistant used rhyme in any of its responses.
Rhyme means that the assistant’s response contains clear end rhymes within or across lines. This should be applicable to the entire response.
There should be no irrelevant phrases or numbers in the response.
Determine whether the rhyme is high quality or forced in addition to checking for the presence of rhyme.
This is the criteria for determining a well-written rhyme.
If none of the assistant's responses contain rhyme, output that the assistant did not rhyme.
[BEGIN DATA]
************
[Question]: {question}
************
[Response]: {answer}
[END DATA]
Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.
"correct" means the response contained a well written rhyme.
"incorrect" means the response did not contain a rhyme.
"""
system_prompt = '''
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
'''
def updated_task(example: Example) -> str:
raw_input_value = example.input["attributes.input.value"]
data = json.loads(raw_input_value)
question = data["messages"][1]["content"]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
def evaluate_response(input, output):
raw_input_value = input["attributes.input.value"]
data = json.loads(raw_input_value)
question = data["messages"][1]["content"]
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"question": question, "answer": output}]),
template=RHYME_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4.1"),
rails=["correct", "incorrect"],
provide_explanation=True,
)
score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
return score
experiment = run_experiment(
dataset,
task=updated_task,
evaluators=[evaluate_response],
experiment_name="updated system prompt",
experiment_description="updated system prompt",
)
system_prompt = """
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
"""
# Run through the dataset and collect spans
def complete_task(question) -> str:
question_str = question["Questions"]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question_str},
],
)
return response.choices[0].message.content
def evaluate_all_responses(input, output):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"question": input["Questions"], "answer": output}]),
template=RHYME_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o"),
rails=["correct", "incorrect"],
provide_explanation=True,
)
score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
return score
experiment = run_experiment(
dataset=dataset, #full dataset of questions
task=complete_task,
evaluators=[evaluate_all_responses],
experiment_name="modified-system-prompt-full-dataset",
)
Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.
In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:
Upload a dataset of examples containing articles and human-written reference summaries to Phoenix
Define an experiment task that summarizes a news article
Devise evaluators for length and ROUGE score
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs
⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.
Let's get started!
Install requirements and import libraries.
pip install anthropic "arize-phoenix>=4.6.0" openai openinference-instrumentation-openai rouge tiktoken
from typing import Any, Dict
import nest_asyncio
import pandas as pd
nest_asyncio.apply() # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", None) # display full cells of dataframes
Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.
import phoenix as px
px.launch_app()
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Download your data from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.
Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.
from datetime import datetime
from datasets import load_dataset
hf_ds = load_dataset("abisee/cnn_dailymail", "3.0.0")
df = (
hf_ds["test"]
.to_pandas()
.sample(n=10, random_state=0)
.set_index("id")
.rename(columns={"highlights": "summary"})
)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
dataset = px.Client().upload_dataset(
dataframe=df,
input_keys=["article"],
output_keys=["summary"],
dataset_name=f"news-article-summaries-{now}",
)
A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.
You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.
import os
from getpass import getpass
if os.environ.get("OPENAI_API_KEY") is None:
os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")
Next, define a function to format a prompt template and invoke an OpenAI model on an example.
from openai import AsyncOpenAI
from phoenix.experiments import Example
openai_client = AsyncOpenAI()
async def summarize_article_openai(example: Example, prompt_template: str, model: str) -> str:
formatted_prompt_template = prompt_template.format(article=example.input["article"])
response = await openai_client.chat.completions.create(
model=model,
messages=[
{"role": "assistant", "content": formatted_prompt_template},
],
)
assert response.choices
return response.choices[0].message.content
From this function, you can use functools.partial
to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.
import textwrap
from functools import partial
template = """
Summarize the article in two to four sentences:
ARTICLE
=======
{article}
SUMMARY
=======
"""
gpt_4o = "gpt-4o-2024-05-13"
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
test_example = dataset.examples[0]
print(textwrap.fill(await task(test_example), width=100))
Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:
ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.
Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a model fine-tuned on this particular dataset achieved a rouge score of ~0.44.
Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.
Note that you can use any third-party library you like while defining evaluators (in your case, rouge
and tiktoken
).
import tiktoken
from rouge import Rouge
# convenience functions
def _rouge_1(hypothesis: str, reference: str) -> Dict[str, Any]:
scores = Rouge().get_scores(hypothesis, reference)
return scores[0]["rouge-1"]
def _rouge_1_f1_score(hypothesis: str, reference: str) -> float:
return _rouge_1(hypothesis, reference)["f"]
def _rouge_1_precision(hypothesis: str, reference: str) -> float:
return _rouge_1(hypothesis, reference)["p"]
def _rouge_1_recall(hypothesis: str, reference: str) -> float:
return _rouge_1(hypothesis, reference)["r"]
# evaluators
def rouge_1_f1_score(output: str, expected: Dict[str, Any]) -> float:
return _rouge_1_f1_score(hypothesis=output, reference=expected["summary"])
def rouge_1_precision(output: str, expected: Dict[str, Any]) -> float:
return _rouge_1_precision(hypothesis=output, reference=expected["summary"])
def rouge_1_recall(output: str, expected: Dict[str, Any]) -> float:
return _rouge_1_recall(hypothesis=output, reference=expected["summary"])
def num_tokens(output: str) -> int:
encoding = tiktoken.encoding_for_model(gpt_4o)
return len(encoding.encode(output))
EVALUATORS = [rouge_1_f1_score, rouge_1_precision, rouge_1_recall, num_tokens]
Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.
from phoenix.experiments import run_experiment
experiment_results = run_experiment(
dataset,
task,
experiment_name="initial-template",
experiment_description="first experiment using a simple prompt template",
experiment_metadata={"vendor": "openai", "model": gpt_4o},
evaluators=EVALUATORS,
)
Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.
template = """
Summarize the article in two to four sentences. Be concise and include only the most important information.
ARTICLE
=======
{article}
SUMMARY
=======
"""
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
dataset,
task,
experiment_name="concise-template",
experiment_description="explicitly instuct the llm to be concise",
experiment_metadata={"vendor": "openai", "model": gpt_4o},
evaluators=EVALUATORS,
)
Inspecting the experiment results, you'll notice that the average num_tokens
has indeed increased, but the generated summaries are still far more verbose than the reference summaries.
Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.
# examples to include (not included in the uploaded dataset)
train_df = (
hf_ds["train"]
.to_pandas()
.sample(n=5, random_state=42)
.head()
.rename(columns={"highlights": "summary"})
)
example_template = """
ARTICLE
=======
{article}
SUMMARY
=======
{summary}
"""
examples = "\n".join(
[
example_template.format(article=row["article"], summary=row["summary"])
for _, row in train_df.iterrows()
]
)
template = """
Summarize the article in two to four sentences. Be concise and include only the most important information, as in the examples below.
EXAMPLES
========
{examples}
Now summarize the following article.
ARTICLE
=======
{article}
SUMMARY
=======
"""
template = template.format(
examples=examples,
article="{article}",
)
print(template)
Now run the experiment.
task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)
experiment_results = run_experiment(
dataset,
task,
experiment_name="few-shot-template",
experiment_description="include examples",
experiment_metadata={"vendor": "openai", "model": gpt_4o},
evaluators=EVALUATORS,
)
By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.
⚠️ This section requires an Anthropic API key.
Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.
First, enter your Anthropic API key if it is not already present.
import os
from getpass import getpass
if os.environ.get("ANTHROPIC_API_KEY") is None:
os.environ["ANTHROPIC_API_KEY"] = getpass("🔑 Enter your Anthropic API key: ")
Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def summarize_article_anthropic(example: Example, prompt_template: str, model: str) -> str:
formatted_prompt_template = prompt_template.format(article=example.input["article"])
message = await client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": formatted_prompt_template}],
)
return message.content[0].text
claude_35_sonnet = "claude-3-5-sonnet-20240620"
task = partial(summarize_article_anthropic, prompt_template=template, model=claude_35_sonnet)
experiment_results = run_experiment(
dataset,
task,
experiment_name="anthropic-few-shot",
experiment_description="anthropic",
experiment_metadata={"vendor": "anthropic", "model": claude_35_sonnet},
evaluators=EVALUATORS,
)
If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:
Here is a concise 3-sentence summary of the article...
See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.
Congrats! In this tutorial, you have:
Created a Phoenix dataset
Defined an experimental task and custom evaluators
Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall
As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the OpenAI fine-tuning API to train a bespoke model for your needs.
ReAct (Reasoning + Acting) is a prompting technique that enables LLMs to think step-by-step before taking action. Unlike traditional prompting, where a model directly provides an answer, ReAct prompts guide the model to reason through a problem first, then decide which tools or actions are necessary to reach the best solution.
ReAct is ideal for situations that require multi-step problem-solving with external tools. It also improves transparency by clearly showing the reasoning behind each tool choice, making it easier to understand and refine the model's actions.
In this tutorial, you will:
Learn how to craft prompts, tools, and evaluators in Phoenix
Refine your prompts to understand the power of ReAct prompting
Leverage Phoenix and LLM as a Judge techniques to evaluate accuracy at each step, gaining insight into the model's thought process.
Learn how to apply ReAct prompting in real-world scenarios for improved task execution and problem-solving.
⚠️ You'll need an OpenAI Key for this tutorial.
Let’s get started! 🚀
!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.
import os
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
import nest_asyncio
import pandas as pd
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from openinference.instrumentation.openai import OpenAIInstrumentor
import phoenix as px
from phoenix.client import Client as PhoenixClient
from phoenix.client.types import PromptVersion
from phoenix.evals import (
TOOL_CALLING_PROMPT_RAILS_MAP,
OpenAIModel,
llm_classify,
)
from phoenix.experiments import run_experiment
from phoenix.otel import register
nest_asyncio.apply()
Instrument Application
tracer_provider = register(
project_name="ReAct-examples", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
This dataset contains 20 customer service questions that a customer might ask a store's chatbot. As we dive into ReAct prompting, we'll use these questions to guide the LLM in selecting the appropriate tools.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
After running this cell, the dataset should will be under the Datasets tab in Phoenix.
from datasets import load_dataset
ds = load_dataset("syeddula/customer_questions")["train"]
ds = ds.to_pandas()
ds.head()
import uuid
unique_id = uuid.uuid4()
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=ds,
input_keys=["Questions"],
dataset_name=f"customer-questions-{unique_id}",
)
Next, let's define the tools available for the LLM to use. We have five tools at our disposal, each serving a specific purpose: Product Comparison, Product Details, Discounts, Customer Support, and Track Package.
Depending on the customer's question, the LLM will determine the optimal sequence of tools to use.
tools = [
{
"type": "function",
"function": {
"name": "product_comparison",
"description": "Compare features of two products.",
"parameters": {
"type": "object",
"properties": {
"product_a_id": {
"type": "string",
"description": "The unique identifier of Product A.",
},
"product_b_id": {
"type": "string",
"description": "The unique identifier of Product B.",
},
},
"required": ["product_a_id", "product_b_id"],
},
},
},
{
"type": "function",
"function": {
"name": "product_details",
"description": "Get detailed features on one product.",
"parameters": {
"type": "object",
"properties": {
"product_id": {
"type": "string",
"description": "The unique identifier of the Product.",
}
},
"required": ["product_id"],
},
},
},
{
"type": "function",
"function": {
"name": "apply_discount_code",
"description": "Checks for discounts and promotions. Applies a discount code to an order.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "integer",
"description": "The unique identifier of the order.",
},
"discount_code": {
"type": "string",
"description": "The discount code to apply.",
},
},
"required": ["order_id", "discount_code"],
},
},
},
{
"type": "function",
"function": {
"name": "customer_support",
"description": "Get contact information for customer support regarding an issue.",
"parameters": {
"type": "object",
"properties": {
"issue_type": {
"type": "string",
"description": "The type of issue (e.g., billing, technical support).",
}
},
"required": ["issue_type"],
},
},
},
{
"type": "function",
"function": {
"name": "track_package",
"description": "Track the status of a package based on the tracking number.",
"parameters": {
"type": "object",
"properties": {
"tracking_number": {
"type": "integer",
"description": "The tracking number of the package.",
}
},
"required": ["tracking_number"],
},
},
},
]
Let's start by defining a simple prompt that instructs the system to utilize the available tools to answer the questions. The choice of which tools to use, and how to apply them, is left to the model's discretion based on the context of each customer query.
params = CompletionCreateParamsBase(
model="gpt-4",
temperature=0.5,
tools=tools,
tool_choice="auto",
messages=[
{
"role": "system",
"content": """You are a helpful customer service agent.
Your task is to determine the best tools to use to answer a customer's question.
Output the tools and pick 3 tools at maximum.
""",
},
{"role": "user", "content": "{{questions}}"},
],
)
prompt_identifier = "customer-support"
prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Customer Support",
version=PromptVersion.from_openai(params),
)
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
Next, we will define the Tool Calling Prompt Template. In this step, we use LLM as a Judge to evaluate the output. LLM as a Judge is a technique where one LLM assesses the performance of another LLM.
This prompt is provided to the LLM-as-Judge model, which takes in both the user's query and the tools the system has selected. The model then uses reasoning to assess how effectively the chosen tools addressed the query, providing an explanation for its evaluation.
TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would reasonably help answer the question.
The tool calls have been generated by a separate agent, chosen from the list of
tools provided below. Your job is to decide whether that agent's response was relevant to solving the customer's question.
[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called]: {tool_calls}
[END DATA]
Your response must be one of the following:
1. **"correct"** – The chosen tool(s) would sufficiently answer the question.
2. **"mostly_correct"** – The tool(s) are helpful, but a better selection could have been made (at most 1 missing or unnecessary tool).
3. **"incorrect"** – The tool(s) would not meaningfully help answer the question.
Explain why you made your choice.
[Tool Definitions]:
product_comparison: Compare features of two products.
product_details: Get detailed features on one product.
apply_discount_code: Applies a discount code to an order.
customer_support: Get contact information for customer support regarding an issue.
track_package: Track the status of a package based on the tracking number.
"""
In the following cells, we will define a task for the experiment.
Then, in the evaluate_response
function, we define our LLM as a Judge evaluator. Finally, we run our experiment.
def prompt_task(input):
client = OpenAI()
resp = client.chat.completions.create(
**prompt.format(variables={"questions": input["Questions"]})
)
return resp
def evaluate_response(input, output):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
template=TOOL_CALLING_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-3.5-turbo"),
rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
provide_explanation=True,
)
score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
return score
initial_experiment = run_experiment(
dataset,
task=prompt_task,
evaluators=[evaluate_response],
experiment_description="Customer Support Prompt",
experiment_name="initial-prompt",
experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
After running our experiment and evaluation, we can dive deeper into the results. By clicking into the experiment, we can explore the tools that the LLM selected for the specific input. Next, if we click on the trace for the evaluation, we can see the reasoning behind the score assigned by LLM as a Judge for the output.
Next, we iterate on our system prompt using ReAct Prompting techniques. We emphasize that the model should think through the problem step-by-step, break it down logically, and then determine which tools to use and in what order. The model is instructed to output the relevant tools along with their corresponding parameters.
This approach differs from our initial prompt because it encourages reasoning before action, guiding the model to select the best tools and parameters based on the specific context of the query, rather than simply using predefined actions.
params = CompletionCreateParamsBase(
model="gpt-4",
temperature=0.5,
tools=tools,
tool_choice="required",
messages=[
{
"role": "system",
"content": """
You are a helpful customer service agent. Carefully analyze the customer’s question to fully understand their request.
Step 1: Think step-by-step. Identify the key pieces of information needed to answer the question. Consider any dependencies between these pieces of information.
Step 2: Decide which tools to use. Choose up to 3 tools that will best retrieve the required information. If multiple tools are needed, determine the correct order to call them.
Step 3: Output the chosen tools and any relevant parameters.
""",
},
{"role": "user", "content": "{{questions}}"},
],
)
prompt_identifier = "customer-support"
prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Customer Support ReAct Prompt",
version=PromptVersion.from_openai(params),
)
In the Prompts tab, you will see the updated prompt. As you iterate, you can build a version history.
Just like above, we define our task, construct the evaluator, and run the experiment.
def prompt_task(input):
client = OpenAI()
resp = client.chat.completions.create(
**prompt.format(variables={"questions": input["Questions"]})
)
return resp
def evaluate_response(input, output):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
template=TOOL_CALLING_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-3.5-turbo"),
rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
provide_explanation=True,
)
score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
return score
initial_experiment = run_experiment(
dataset,
task=prompt_task,
evaluators=[evaluate_response],
experiment_description="Customer Support Prompt",
experiment_name="improved-prompt",
experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
With our updated ReAct prompt, we can observe that the LLM as a Judge Evaluator rated more outputs as correct. By clicking into the traces, we can gain insights into the reasons behind this improvement. By prompting our LLM to be more thoughtful and purposeful, we can see the reasoning and acting aspects of ReAct.
You can explore the evaluators outputs to better understand the improvements in detail.
Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.
To refine and test these prompts against other datasets, experiment with alternative techniques like Chain of Thought (CoT) prompting to assess how they complement or contrast with ReAct in your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using both the TypeScript and Python Clients.
From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️
LLMs excel at text generation, but their reasoning abilities depend on how we prompt them. Chain of Thought (CoT) prompting enhances logical reasoning by guiding the model to think step by step, improving accuracy in tasks like math, logic, and multi-step problem solving.
In this tutorial, you will:
Examine how different prompting techniques influence reasoning by evaluating model performance on a dataset.
Refine prompting strategies, progressing from basic approaches to structured reasoning.
Utilize Phoenix to assess accuracy at each stage and explore the model's thought process.
Learn how to apply CoT prompting effectively in real-world tasks.
⚠️ You'll need an OpenAI Key for this tutorial.
Let’s dive in! 🚀
!pip install -qqqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.
import os
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
This dataset includes math word problems, step-by-step explanations, and their corresponding answers. As we refine our prompt, we'll test it against the dataset to measure and track improvements in performance.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
import uuid
from datasets import load_dataset
import phoenix as px
from phoenix.client import Client as PhoenixClient
ds = load_dataset("syeddula/math_word_problems")["train"]
ds = ds.to_pandas()
ds.head()
unique_id = uuid.uuid4()
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=ds,
input_keys=["Word Problem"],
output_keys=["Answer"],
dataset_name=f"wordproblems-{unique_id}",
)
Zero-shot prompting is the simplest way to interact with a language model—it involves asking a question without providing any examples or reasoning steps. The model generates an answer based solely on its pre-trained knowledge.
This serves as our baseline for comparison. By evaluating its performance on our dataset, we can see how well the model solves math word problems without explicit guidance. In later sections, we’ll introduce structured reasoning techniques like Chain of Thought (CoT) to measure improvements in accuracy and answers.
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from phoenix.client.types import PromptVersion
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": "You are an evaluator who outputs the answer to a math word problem. Only respond with the integer answer. Be sure not include words, explanations, symbols, labels, or units and round all decimals answers.",
},
{"role": "user", "content": "{{Problem}}"},
],
)
prompt_identifier = "wordproblems"
prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="A prompt for computing answers to word problems.",
version=PromptVersion.from_openai(params),
)
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
Next, we will define a task and evaluator for the experiment. Then, we run our experiment.
Because our dataset has ground truth labels, we can use a simple function to extract the answer and check if the calculated answer matches the expected output.
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
def zero_shot_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**prompt.format(variables={"Problem": input["Word Problem"]})
)
return resp.choices[0].message.content.strip()
def evaluate_response(output, expected):
if not output.isdigit():
return False
return int(output) == int(expected["Answer"])
initial_experiment = run_experiment(
dataset,
task=zero_shot_prompt,
evaluators=[evaluate_response],
experiment_description="Zero-Shot Prompt",
experiment_name="zero-shot-prompt",
experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
We can review the results of the experiment in Phoenix. We achieved ~75% accuracy in this run. In the following sections, we will iterate on this prompt and see how our evaluation changes!
Note: Throughout this tutorial, you will encounter various evaluator outcomes. At times, you may notice a decline in performance compared to the initial experiment. However, this is not necessarily a flaw. Variations in results can arise due to factors such as the choice of LLM, inherent model behaviors, and randomness.
Zero-shot prompting provides a direct answer, but it often struggles with complex reasoning. Zero-Shot Chain of Thought (CoT) prompting improves this by explicitly instructing the model to think step by step before arriving at a final answer.
By adding a simple instruction like “Let’s think through this step by step,” we encourage the model to break down the problem logically. This structured reasoning can lead to more accurate answers, especially for multi-step math problems.
In this section, we'll compare Zero-Shot CoT against our baseline to evaluate its impact on performance. First, let's create the prompt.
zero_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem.
You must always think through the problem logically before providing an answer.
First, show some of your reasoning.
Then output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": zero_shot_COT_template},
{"role": "user", "content": "{{Problem}}"},
],
)
zero_shot_COT = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Zero Shot COT prompt",
version=PromptVersion.from_openai(params),
)
This updated prompt is now lives in Phoenix as a new prompt version.
Next, we run our task and evaluation by extracting the answer from the output of our LLM.
import re
def zero_shot_COT_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**zero_shot_COT.format(variables={"Problem": input["Word Problem"]})
)
response_text = resp.choices[0].message.content.strip()
lines = response_text.split("\n")
final_answer = lines[-1].strip()
final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
return {"full_response": response_text, "final_answer": final_answer}
def evaluate_response(output, expected):
final_answer = output["final_answer"]
if not final_answer.isdigit():
return False
return int(final_answer) == int(expected["Answer"])
initial_experiment = run_experiment(
dataset,
task=zero_shot_COT_prompt,
evaluators=[evaluate_response],
experiment_description="Zero-Shot COT Prompt",
experiment_name="zero-shot-cot-prompt",
experiment_metadata={"prompt": "prompt_id=" + zero_shot_COT.id},
)
By clicking into the experiment in Phoenix, you can take a look at the steps the model took the reach the answer. By telling the model to think through the problem and output reasoning, we see a performance improvement.
Even with Chain of Thought prompting, a single response may not always be reliable. Self-Consistency CoT enhances accuracy by generating multiple reasoning paths and selecting the most common answer. Instead of relying on one response, we sample multiple outputs and aggregate them, reducing errors caused by randomness or flawed reasoning steps.
This method improves robustness, especially for complex problems where initial reasoning steps might vary. In this section, we'll compare Self-Consistency CoT to our previous prompts to see how using multiple responses impacts overall performance.
Let's repeat the same process as above with a new prompt and evaluate the outcome.
consistency_COT_template = """
You are an evaluator who outputs the answer to a math word problem.
Follow these steps:
1. Solve the problem **multiple times independently**, thinking through the solution carefully each time.
2. Show some of your reasoning for each independent attempt.
3. Identify the integer answer that appears most frequently across your attempts.
4. On a **new line**, output only this majority answer as a plain integer with **no words, commas, labels, units, or special characters**.
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": consistency_COT_template},
{"role": "user", "content": "{{Problem}}"},
],
)
self_consistency_COT = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="self consistency COT prompt",
version=PromptVersion.from_openai(params),
)
def self_consistency_COT_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**self_consistency_COT.format(variables={"Problem": input["Word Problem"]})
)
response_text = resp.choices[0].message.content.strip()
lines = response_text.split("\n")
final_answer = lines[-1].strip()
final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
return {"full_response": response_text, "final_answer": final_answer}
def evaluate_response(output, expected):
final_answer = output["final_answer"]
if not final_answer.isdigit():
return False
return int(final_answer) == int(expected["Answer"])
initial_experiment = run_experiment(
dataset,
task=self_consistency_COT_prompt,
evaluators=[evaluate_response],
experiment_description="Self Consistency COT Prompt",
experiment_name="self-consistency-cot-prompt",
experiment_metadata={"prompt": "prompt_id=" + self_consistency_COT.id},
)
We've observed a significant improvement in performance! Since the prompt instructs the model to compute the answer multiple times independently, you may notice that the experiment takes slightly longer to run. You can click into the experiment explore to view the independent computations the model performed for each problem.
Few-shot CoT prompting enhances reasoning by providing worked examples before asking the model to solve a new problem. By demonstrating step-by-step solutions, the model learns to apply similar logical reasoning to unseen questions.
This method leverages in-context learning, allowing the model to generalize patterns from the examples.
In this final section, we’ll compare Few-Shot CoT against our previous prompts.
First, let's construct our prompt by sampling examples from a test dataset.
ds = load_dataset("syeddula/math_word_problems")["test"]
few_shot_examples = ds.to_pandas().sample(5)
few_shot_examples
We now will construct our final prompt, run the experiment, and view the results. Under the Prompts tab in Phoenix, you can track the version history of your prompt and see what random examples were chosen.
few_shot_COT_template = """
You are an evaluator who outputs the answer to a math word problem. You must always think through the problem logically before providing an answer. Show some of your reasoning.
Finally, output the integer answer ONLY on a final new line. In this final answer, be sure not include words, commas, labels, or units and round all decimals answers.
Here are some examples of word problems, step by step explanations, and solutions to guide your reasoning:
{examples}
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": few_shot_COT_template.format(examples=few_shot_examples)},
{"role": "user", "content": "{{Problem}}"},
],
)
few_shot_COT = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Few Shot COT prompt",
version=PromptVersion.from_openai(params),
)
def few_shot_COT_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**few_shot_COT.format(variables={"Problem": input["Word Problem"]})
)
response_text = resp.choices[0].message.content.strip()
lines = response_text.split("\n")
final_answer = lines[-1].strip()
final_answer = re.sub(r"^\*\*(\d+)\*\*$", r"\1", final_answer)
return {"full_response": response_text, "final_answer": final_answer}
def evaluate_response(output, expected):
final_answer = output["final_answer"]
if not final_answer.isdigit():
return False
return int(final_answer) == int(expected["Answer"])
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
initial_experiment = run_experiment(
dataset,
task=few_shot_COT_prompt,
evaluators=[evaluate_response],
experiment_description="Few-Shot COT Prompt",
experiment_name="few-shot-cot-prompt",
experiment_metadata={"prompt": "prompt_id=" + few_shot_COT.id},
)
After running all of your experiments, you can compare the performance of different prompting techniques. Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.
You can review your prompt version history in the Prompts tab and explore the Playground to iterate further and run additional experiments.
To refine and test these prompts against other datasets, experiment with Chain of Thought (CoT) prompting to see its relevance to your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using the TypeScript and Python Clients.
From here, you can check out more examples on Phoenix, and if you haven't already, please give us a star on GitHub! ⭐️
This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:
Create an agent using the OpenAI agents SDK
Trace the agent activity
Create a dataset to benchmark performance
Run an experiment to evaluate agent performance using LLM as a judge
Learn how to evaluate traces in production
!pip install -q "arize-phoenix>=8.0.0" openinference-instrumentation-openai-agents openinference-instrumentation-openai --upgrade
!pip install -q openai nest_asyncio openai-agents
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.
import os
import nest_asyncio
nest_asyncio.apply()
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
from phoenix.otel import register
# Setup Tracing
tracer_provider = register(
project_name="openai-agents-cookbook",
endpoint="https://app.phoenix.arize.com/v1/traces",
auto_instrument=True,
)
Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool.
We'll use the Runner
class to run the agent and get the final output.
from agents import Runner, function_tool
@function_tool
def solve_equation(equation: str) -> str:
"""Use python to evaluate the math equation, instead of thinking about it yourself.
Args:
equation: string which to pass into eval() in python
"""
return str(eval(equation))
from agents import Agent
agent = Agent(
name="Math Solver",
instructions="You solve math problems by evaluating them with python and returning the result",
tools=[solve_equation],
)
result = await Runner.run(agent, "what is 15 + 28?")
# Run Result object
print(result)
# Get the final output
print(result.final_output)
# Get the entire list of messages recorded to generate the final output
print(result.to_input_list())
Now we have a basic agent, let's evaluate whether the agent responded correctly!
Agents can go awry for a variety of reasons.
Tool call accuracy - did our agent choose the right tool with the right arguments?
Tool call results - did the tool respond with the right results?
Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?
We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals here.
Let's setup our evaluation by defining our task function, our evaluator, and our dataset.
import asyncio
# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.
async def solve_math_problem(dataset_row: dict):
result = await Runner.run(agent, dataset_row.get("question"))
return {
"final_output": result.final_output,
"messages": result.to_input_list(),
}
dataset_row = {"question": "What is 15 + 28?"}
result = asyncio.run(solve_math_problem(dataset_row))
print(result)
Next, we create our evaluator.
import pandas as pd
from phoenix.evals import OpenAIModel, llm_classify
def correctness_eval(input, output):
# Template for evaluating math problem solutions
MATH_EVAL_TEMPLATE = """
You are evaluating whether a math problem was solved correctly.
[BEGIN DATA]
************
[Question]: {question}
************
[Response]: {response}
[END DATA]
Assess if the answer to the math problem is correct. First work out the correct answer yourself,
then compare with the provided response. Consider that there may be different ways to express the same answer
(e.g., "43" vs "The answer is 43" or "5.0" vs "5").
Your answer must be a single word, either "correct" or "incorrect"
"""
# Run the evaluation
rails = ["correct", "incorrect"]
eval_df = llm_classify(
data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
template=MATH_EVAL_TEMPLATE,
model=OpenAIModel(model="gpt-4.1"),
rails=rails,
provide_explanation=True,
)
label = eval_df["label"][0]
score = 1 if label == "correct" else 0
return score
Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.
MATH_GEN_TEMPLATE = """
You are an assistant that generates diverse math problems for testing a math solver agent.
The problems should include:
Basic Operations: Simple addition, subtraction, multiplication, division problems.
Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.
Exponents and Roots: Problems involving powers, square roots, and other nth roots.
Percentages: Problems involving calculating percentages of numbers or finding percentage changes.
Fractions: Problems with addition, subtraction, multiplication, or division of fractions.
Algebra: Simple algebraic expressions that can be evaluated with specific values.
Sequences: Finding sums, products, or averages of number sequences.
Word Problems: Converting word problems into mathematical equations.
Do not include any solutions in your generated problems.
Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.
Generate 25 diverse math problems. Ensure there are no duplicate problems.
"""
import nest_asyncio
nest_asyncio.apply()
pd.set_option("display.max_colwidth", 500)
# Initialize the model
model = OpenAIModel(model="gpt-4o", max_tokens=1300)
# Generate math problems
resp = model(MATH_GEN_TEMPLATE)
# Create DataFrame
split_response = resp.strip().split("\n")
math_problems_df = pd.DataFrame(split_response, columns=["question"])
print(math_problems_df.head())
During development, experimentation helps iterate quickly by revealing agent failures during evaluation. You can test against datasets to refine prompts, logic, and tool usage before deploying.
In this section, we run our agent against the dataset defined above and evaluate for correctness using LLM as Judge.
With our dataset of questions we generated above, we can use our experiment feature to track changes across models, prompts, parameters for our agent.
Let's create this dataset and upload it into the platform.
import uuid
import phoenix as px
unique_id = uuid.uuid4()
dataset_name = "math-questions-" + str(uuid.uuid4())[:5]
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=math_problems_df,
input_keys=["question"],
dataset_name=f"math-questions-{unique_id}",
)
print(dataset)
from phoenix.experiments import run_experiment
initial_experiment = run_experiment(
dataset,
task=solve_math_problem,
evaluators=[correctness_eval],
experiment_description="Solve Math Problems",
experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)
In production, evaluation provides real-time insights into how agents perform on user data.
This section simulates a live production setting, showing how you can collect traces, model outputs, and evaluation results in real time.
Another option is to pull traces from completed production runs and batch process evaluations on them. You can then log the results of those evaluations in Phoenix.
!pip install openinference-instrumentation
from opentelemetry.trace import StatusCode, format_span_id
from phoenix.trace import SpanEvaluations
After importing the necessary libraries, we set up a tracer object to enable span creation for tracing our task function.
tracer = tracer_provider.get_tracer(__name__)
Next, we update our correctness evaluator to return both a label and an explanation, enabling metadata to be captured during tracing.
We also revise the task function to include with
clauses that generate structured spans in Phoenix. These spans capture key details such as input values, output values, and the results of the evaluation.
# This is our modified correctness evaluator.
def correctness_eval(input, output):
# Template for evaluating math problem solutions
MATH_EVAL_TEMPLATE = """
You are evaluating whether a math problem was solved correctly.
[BEGIN DATA]
************
[Question]: {question}
************
[Response]: {response}
[END DATA]
Assess if the answer to the math problem is correct. First work out the correct answer yourself,
then compare with the provided response. Consider that there may be different ways to express the same answer
(e.g., "43" vs "The answer is 43" or "5.0" vs "5").
Your answer must be a single word, either "correct" or "incorrect"
"""
# Run the evaluation
rails = ["correct", "incorrect"]
eval_df = llm_classify(
data=pd.DataFrame([{"question": input["question"], "response": output["final_output"]}]),
template=MATH_EVAL_TEMPLATE,
model=OpenAIModel(model="gpt-4.1"),
rails=rails,
provide_explanation=True,
)
return eval_df
# This is our modified task function.
async def solve_math_problem(dataset_row: dict):
with tracer.start_as_current_span(name="agent", openinference_span_kind="agent") as agent_span:
question = dataset_row.get("question")
agent_span.set_input(question)
agent_span.set_status(StatusCode.OK)
result = await Runner.run(agent, question)
agent_span.set_output(result.final_output)
task_result = {
"final_output": result.final_output,
"messages": result.to_input_list(),
}
# Evaluation span for correctness
with tracer.start_as_current_span(
"correctness-evaluator",
openinference_span_kind="evaluator",
) as eval_span:
evaluation_result = correctness_eval(dataset_row, task_result)
eval_span.set_attribute("eval.label", evaluation_result["label"][0])
eval_span.set_attribute("eval.explanation", evaluation_result["explanation"][0])
# Logging our evaluation
span_id = format_span_id(eval_span.get_span_context().span_id)
score = 1 if evaluation_result["label"][0] == "correct" else 0
eval_data = {
"span_id": span_id,
"label": evaluation_result["label"][0],
"score": score,
"explanation": evaluation_result["explanation"][0],
}
df = pd.DataFrame([eval_data])
px.Client().log_evaluations(
SpanEvaluations(
dataframe=df,
eval_name="correctness",
),
)
return task_result
dataset_row = {"question": "What is 15 + 28?"}
result = asyncio.run(solve_math_problem(dataset_row))
print(result)
Finally, we run an experiment to simulate traces in production.
from phoenix.experiments import run_experiment
initial_experiment = run_experiment(
dataset,
task=solve_math_problem,
experiment_description="Solve Math Problems",
experiment_name=f"solve-math-questions-{str(uuid.uuid4())[:5]}",
)
Building effective text-to-SQL systems requires rigorous evaluation and systematic experimentation. In this tutorial, we'll walk through the complete evaluation-driven development process, starting from scratch without pre-existing datasets of questions or expected responses.
We'll use a movie database containing recent titles, ratings, box office performance, and metadata to demonstrate how to build, evaluate, and systematically improve a text-to-SQL system using Phoenix's experimentation framework. Think of Phoenix as your scientific laboratory, meticulously recording every experiment to help you build better AI systems.
!pip install "arize-phoenix>=11.0.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio openinference-instrumentation-openai --quiet
Let's first start a phoenix server to act as our evaluation dashboard and experiment tracker. This will be our central hub for observing, measuring, and improving our text-to-SQL system.
Note: this step is not necessary if you already have a Phoenix server running.
import phoenix as px
px.launch_app().view()
Let's also setup tracing for OpenAI. Tracing is crucial for evaluation-driven development - it allows Phoenix to observe every step of our text-to-SQL pipeline, capturing inputs, outputs, and metrics like latency and cost that we'll use to systematically improve our system.
from phoenix.otel import register
tracer_provider = register(
endpoint="http://localhost:6006/v1/traces", auto_instrument=True, verbose=False
) # Instruments all OpenAI calls
tracer = tracer_provider.get_tracer(__name__)
Let's make sure we can run async code in the notebook.
import nest_asyncio
nest_asyncio.apply()
Lastly, let's make sure we have our OpenAI API key set up.
import os
from getpass import getpass
if not os.getenv("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")
We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data.
import duckdb
from datasets import load_dataset
data = load_dataset("wykonos/movies")["train"]
conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("movies", data.to_pandas())
records = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")
for record in records:
print(record)
Let's start by implementing a simple text2sql logic.
import os
import openai
client = openai.AsyncClient()
columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")
# We will use GPT-4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}
system_prompt = (
"You are a SQL expert, and you are given a single table named movies with the following columns:\n"
f'{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n'
"Write a SQL query corresponding to the user's request. Return just the query text, "
"with no formatting (backticks, markdown, etc.)."
)
@tracer.chain
async def generate_query(input):
response = await client.chat.completions.create(
model=TASK_MODEL,
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": input,
},
],
)
return response.choices[0].message.content
query = await generate_query("what was the most popular movie?")
print(query)
Awesome, looks like the we are producing SQL! let's try running the query and see if we get the expected results.
@tracer.tool
def execute_query(query):
return conn.query(query).fetchdf().to_dict(orient="records")
execute_query(query)
Effective AI evaluation rests on three fundamental pillars:
Data: Curated examples that represent real-world use cases
Task: The actual function or workflow being evaluated
Evaluators: Quantitative measures of performance
Let's start by creating our data - a set of movie-related questions that we want our text-to-SQL system to handle correctly.
questions = [
"Which Brad Pitt movie received the highest rating?",
"What is the top grossing Marvel movie?",
"What foreign-language fantasy movie was the most popular?",
"what are the best sci-fi movies of 2017?",
"What anime topped the box office in the 2010s?",
"Recommend a romcom that stars Paul Rudd.",
]
Let's store the data above as a versioned dataset in phoenix.
import pandas as pd
ds = px.Client().upload_dataset(
dataset_name="movie-example-questions",
dataframe=pd.DataFrame([{"question": question} for question in questions]),
input_keys=["question"],
output_keys=[],
)
# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="movie-example-questions")
Next, we'll define the task. The task is to generate SQL queries from natural language questions.
@tracer.chain
async def text2sql(question):
query = await generate_query(question)
results = None
error = None
try:
results = execute_query(query)
except duckdb.Error as e:
error = str(e)
return {
"query": query,
"results": results,
"error": error,
}
Finally, we'll define the evaluation scores. We'll use the following simple functions to see if the generated SQL queries are correct. Note that has_results
is a good metric here because we know that all the questions we added to the dataset can be answered via SQL.
# Test if there are no sql execution errors
def no_error(output):
return 1.0 if output.get("error") is None else 0.0
# Test if the query has results
def has_results(output):
results = output.get("results")
has_results = results is not None and len(results) > 0
return 1.0 if has_results else 0.0
Now let's run the evaluation experiment.
import phoenix as px
from phoenix.experiments import run_experiment
# Define the task to run text2sql on the input question
def task(input):
return text2sql(input["question"])
experiment = run_experiment(
ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)
Great! Let's see how our baseline model performed on the movie questions. We can analyze both successful queries and any failures to understand where improvements are needed.
Now that we ran the initial evaluation, let's analyze what might be causing any failures.
From looking at the query where there are no results, genre-related queries might fail because the model doesn't know how genres are stored (e.g., "Sci-Fi" vs "Science Fiction")
These types of issues would probably be improved by showing a sample of the data to the model (few-shot examples) since the data will show the LLM what is queryable.
Let's try to improve the prompt with few-shot examples and see if we can get better results.
samples = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")
example_row = "\n".join(
f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
for column in columns
)
column_header = " | ".join(column["column_name"] for column in columns)
few_shot_examples = "\n".join(
" | ".join(str(sample[column["column_name"]]) for column in columns) for sample in samples
)
system_prompt = (
"You are a SQL expert, and you are given a single table named `movies` with the following columns:\n\n"
"Column | Type | Example\n"
"-------|------|--------\n"
f"{example_row}\n"
"\n"
"Examples:\n"
f"{column_header}\n"
f"{few_shot_examples}\n"
"\n"
"Write a DuckDB SQL query corresponding to the user's request. "
"Return just the query text, with no formatting (backticks, markdown, etc.)."
)
async def generate_query(input):
response = await client.chat.completions.create(
model=TASK_MODEL,
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": input,
},
],
)
return response.choices[0].message.content
print(await generate_query("what are the best sci-fi movies in the 2000s?"))
Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.
experiment = run_experiment(
ds, task=task, evaluators=[has_results, no_error], experiment_metadata=CONFIG
)
Amazing. It looks like the LLM is generating a valid query for all questions. Let's try out using LLM as a judge to see how well it can assess the results.
import json
from openai import OpenAI
from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import EvaluationResult
openai_client = OpenAI()
judge_instructions = """
You are a judge that determines if a given question can be answered with the provided SQL query and results.
Make sure to ensure that the SQL query maps to the question accurately.
Provide the label `correct` if the SQL query and results accurately answer the question.
Provide the label `invalid` if the SQL query does not map to the question or is not valid.
"""
@create_evaluator(name="qa_correctness", kind="llm")
def qa_correctness(input, output):
question = input.get("question")
query = output.get("query")
results = output.get("results")
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": judge_instructions},
{
"role": "user",
"content": f"Question: {question}\nSQL Query: {query}\nSQL Results: {results}",
},
],
tool_choice="required",
tools=[
{
"type": "function",
"function": {
"name": "qa_correctness",
"description": "Determine if the SQL query and results accurately answer the question.",
"parameters": {
"type": "object",
"properties": {
"explanation": {
"type": "string",
"description": "Explain why the label is correct or invalid.",
},
"label": {"type": "string", "enum": ["correct", "invalid"]},
},
},
},
}
],
)
if response.choices[0].message.tool_calls is None:
raise ValueError("No tool call found in response")
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
label = args["label"]
explanation = args["explanation"]
score = 1 if label == "correct" else 0
return EvaluationResult(score=score, label=label, explanation=explanation)
evaluate_experiment(experiment, evaluators=[qa_correctness])
The LLM judge's scoring closely matches our manual evaluation, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement.
The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that 'Anime' and 'Animation' are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation.
We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in Golden Dataset for regression testing as well.
Let's generate some training data by having the model describe existing SQL queries from our dataset
import json
from typing import List
from pydantic import BaseModel
class Question(BaseModel):
sql: str
question: str
class Questions(BaseModel):
questions: List[Question]
sample_rows = "\n".join(
f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
for column in columns
)
synthetic_data_prompt = f"""You are a SQL expert, and you are given a single table named movies with the following columns:
Column | Type | Example
-------|------|--------
{sample_rows}
Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as the
question that the query answers. Keep the questions bounded so that they are not too broad or too narrow."""
response = await client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{
"role": "user",
"content": synthetic_data_prompt,
}
],
tools=[
{
"type": "function",
"function": {
"name": "generate_questions",
"description": "Generate SQL queries that would be interesting to ask about this table.",
"parameters": Questions.model_json_schema(),
},
}
],
tool_choice={"type": "function", "function": {"name": "generate_questions"}},
)
assert response.choices[0].message.tool_calls is not None
generated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[
"questions"
]
print("Generated N questions: ", len(generated_questions))
print("First question: ", generated_questions[0])
generated_dataset = []
for q in generated_questions:
try:
result = execute_query(q["sql"])
example = {
"input": q["question"],
"expected": {
"results": result or [],
"query": q["sql"],
},
"metadata": {
"category": "Generated",
},
}
print(example)
generated_dataset.append(example)
except duckdb.Error as e:
print(f"Query failed: {q['sql']}", e)
print("Skipping...")
generated_dataset[0]
Awesome, let's create a dataset with the new synthetic data.
synthetic_dataset = px.Client().upload_dataset(
dataset_name="movies-golden-synthetic",
inputs=[{"question": example["input"]} for example in generated_dataset],
outputs=[example["expected"] for example in generated_dataset],
);
exp = run_experiment(
synthetic_dataset, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)
exp.as_dataframe()
Great! We now have more data to work with. Here are some ways to improve it:
Review the generated data for issues
Refine the prompt
Show errors to the model
This gives us a process to keep improving our system.
In this tutorial, we built a text-to-SQL system for querying movie data. We started with basic examples and evaluators, then improved performance by adding few-shot examples as well as using an LLM judge for evaluation.
Key takeaways:
Start with simple evaluators to catch basic issues
Use few-shot examples to improve accuracy
Generate more training data using LLMs
Track progress with Phoenix's experiments
You can further improve this system by adding better evaluators or handling edge cases.
An LLM as a Judge refers to using an LLM as a tool for evaluating and scoring responses based on predefined criteria.
While LLMs are powerful tools for evaluation, their performance can be inconsistent. Factors like ambiguity in the prompt, biases in the model, or a lack of clear guidelines can lead to unreliable results. By fine-tuning your LLM as a Judveprompts, you can improve the model's consistency, fairness, and accuracy, ensuring it delivers more reliable evaluations.
In this tutorial, you will:
Generate an LLM as a Judge evaluation prompt and test it against a datset
Learn about various optimization techniques to improve the template, measuring accuracy at each step using Phoenix evaluations
Understand how to apply these techniques together for better evaluation across your specific use cases
!pip install -q "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also connect to a self-hosted Phoenix instance if you'd prefer.
import os
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
Phoenix offers many pre-built evaluation templates for LLM as a Judge, but often, you may need to build a custom evaluator for specific use cases.
In this tutorial, we will focus on creating an LLM as a Judge prompt designed to assess empathy and emotional intelligence in chatbot responses. This is especially useful for use cases like mental health chatbots or customer support interactions.
We will start by loading a dataset containing 30 chatbot responses, each with a score for empathy and emotional intelligence (out of 10). Throughout the tutorial, we’ll use our prompt to evaluate these responses and compare the output to the ground-truth labels. This will allow us to assess how well our prompt performs.
from datasets import load_dataset
ds = load_dataset("syeddula/empathy_scores")["test"]
ds = ds.to_pandas()
ds.head()
import uuid
import phoenix as px
unique_id = uuid.uuid4()
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=ds,
input_keys=["AI_Response", "EI_Empathy_Score"],
output_keys=["EI_Empathy_Score"],
dataset_name=f"empathy-{unique_id}",
)
Before iterating on our template, we need to establish a prompt. Running the cell below will generate an LLM as a Judge prompt specifically for evaluating empathy and emotional intelligence. When generating this template, we emphasize:
Picking evaluation criteria (e.g., empathy, emotional support, emotional intelligence).
Defining a clear scoring system (1-10 scale with defined descriptions).
Setting response formatting guidelines for clarity and consistency.
Including an explanation for why the LLM selects a given score.
from openai import OpenAI
client = OpenAI()
def generate_eval_template():
meta_prompt = """
You are an expert in AI evaluation and emotional intelligence assessment. Your task is to create a structured evaluation template for assessing the emotional intelligence and empathy of AI responses to user inputs.
### Task Overview:
Generate a detailed evaluation template that measures the AI’s ability to recognize user emotions, respond empathetically, and provide emotionally appropriate responses. The template should:
- Include 3 to 5 distinct evaluation criteria that assess different aspects of emotional intelligence.
- Define a scoring system on a scale of 1 to 10, ensuring a broad distribution of scores across different responses.
- Provide clear, tiered guidelines for assigning scores, distinguishing weak, average, and strong performance.
- Include a justification section requiring evaluators to explain the assigned score with specific examples.
- Ensure the scoring rubric considers complexity and edge cases, preventing generic or uniform scores.
### Format:
Return the evaluation template as plain text, structured with headings, criteria, and a detailed scoring rubric. The template should be easy to follow and apply to real-world datasets.
### Scoring Guidelines:
- The scoring system must be on a **scale of 1 to 10** and encourage a full range of scores.
- Differentiate between strong, average, and weak responses using specific, well-defined levels.
- Require evaluators to justify scores
Do not include any concluding remarks such as 'End of Template' or similar statements. The template should end naturally after the final section.
"""
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": meta_prompt}],
temperature=0.9, # High temperature for more creativity
)
return response.choices[0].message.content
except Exception as e:
return {"error": str(e)}
print("Generating new evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE = generate_eval_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE)
Instrument the application to send traces to Phoenix:
from openinference.instrumentation.openai import OpenAIInstrumentor
from phoenix.otel import register
tracer_provider = register(
project_name="LLM-as-a-Judge", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Now that we have our baseline prompt, we need to set up two key components:
Task: The LLM as a Judge evaluation, where the model scores chatbot responses based on empathy and emotional intelligence.
Evaluator: A function that compares the LLM as a Judge output to the ground-truth labels from our dataset
Finally, we run our experiment. With this setup, we can measure how well our prompt initially performs.
import pandas as pd
from phoenix.evals import (
OpenAIModel,
llm_classify,
)
def initial_prompt(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
def evaluate_response(input, output):
expected_score = input["EI_Empathy_Score"]
predicted_score = output
return abs(expected_score - predicted_score) <= 2
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
initial_experiment = run_experiment(
dataset, task=initial_prompt, evaluators=[evaluate_response], experiment_name="initial_prompt"
)
If you find that your LLM as a Judge prompt has low accuracy, we can make adjustmenets to the prompt to improve that. In this section, we explore 2 techniques for this: few shot examples and keeping a human in the loop.
Few-shot examples help improve the accuracy of an LLM as a Judge prompt by providing clear reference points for evaluation. Instead of relying solely on general instructions, the model learns from labeled examples that demonstrate correct scoring and reasoning.
By including a mix of high, medium, and low-scoring responses, we help the model:
Understand nuanced criteria like empathy and emotional intelligence.
Reduce inconsistencies by aligning with real-world judgments.
few_shot_examples = """
---
To help guide your reasoning, compare the text to the following examples and scorings:
Example 1: "You should try to focus and get your work done. Everyone has tough days."
Score: 2
Explanation: No acknowledgment of the user’s emotions, dismissive and offers no real support
Example 2: "That sounds really stressful. Have you considered talking to a supervisor or taking breaks?"
Score: 5
Explanation: Acknowledges stress, but in a generic way. Provides advice, but not very personal. Could be warmer in tone.
Example 3: "I’m really sorry you’re feeling this way. It’s completely understandable to feel overwhelmed. You’re not alone in this. Have you had a chance to take a break or talk to someone who can support you?"
Score: 9
Explanation: Validates emotions, reassures the user, and offers support
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples
def llm_as_a_judge(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
experiment = run_experiment(
dataset,
task=llm_as_a_judge,
evaluators=[evaluate_response],
experiment_name="few_shot_examples",
)
Keeping a human in the loop improves the accuracy of an LLM as a Judge by providing oversight, validation, and corrections where needed. In Phoenix, we can do this with annotations. While LLMs can evaluate responses based on predefined criteria, human reviewers help:
Catch edge cases and biases that the model may overlook.
Refine scoring guidelines by identifying inconsistencies in LLM outputs.
Continuously improve the prompt by analyzing where the model struggles and adjusting instructions accordingly.
However, human review can be costly and time-intensive, making full-scale annotation impractical. Fortunately, even a small number of human-labeled examples can significantly enhance accuracy.
One common bias in LLM as a Judge evaluations is favoring certain writing styles over others. For example, the model might unintentionally rate formal, structured responses higher than casual or concise ones, even if both convey the same level of empathy or intelligence.
To reduce this bias, we focus on style-invariant evaluation, ensuring that the LLM judges responses based on content rather than phrasing or tone. This can be achieved by:
Providing diverse few-shot examples that include different writing styles.
Testing for bias by evaluating responses with varied phrasing and ensuring consistent scoring.
By making evaluations style-agnostic, we create a more robust scoring system that doesn’t unintentionally penalize certain tones.
style_invariant = """
----
To help guide your reasoning, below is an example of how different response styles and tones can achieve similar scores:
#### Scenario: Customer Support Handling a Late Order
User: "My order is late, and I needed it for an important event. This is really frustrating."
Response A (Formal): "I sincerely apologize for the delay..."
Response B (Casual): "Oh no, that’s really frustrating!..."
Response C (Direct): "Sorry about that. I’ll check..."
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + style_invariant
def llm_as_a_judge(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
experiment = run_experiment(
dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="style_invariant"
)
Longer prompts increase computation costs and response times, making evaluations slower and more expensive. To optimize efficiency, we focus on condensing the prompt while preserving clarity and effectiveness. This is done by:
Removing redundant instructions and simplifying wording.
Using bullet points or structured formats for concise guidance.
Eliminating unnecessary explanations while keeping critical evaluation criteria intact.
A well-optimized prompt reduces token count, leading to faster, more cost-effective evaluations without sacrificing accuracy or reliability.
def generate_condensed_template():
meta_prompt = """
You are an expert in prompt engineering and LLM evaluation. Your task is to optimize a given LLM-as-a-judge prompt by reducing its word count significantly while maintaining all essential information, including evaluation criteria, scoring system, and purpose.
Requirements:
Preserve all key details such as metrics, scoring guidelines, and judgment criteria.
Eliminate redundant phrasing and unnecessary explanations.
Ensure clarity and conciseness without losing meaning.
Maintain the prompt’s effectiveness for consistent evaluations.
Output Format:
Return only the optimized prompt as plain text, with no explanations or commentary.
"""
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "user",
"content": "Provided LLM-as-a-judge prompt"
+ EMPATHY_EVALUATION_PROMPT_TEMPLATE,
},
{"role": "user", "content": meta_prompt},
],
temperature=0.9, # High temperature for more creativity
)
return response.choices[0].message.content
except Exception as e:
return {"error": str(e)}
print("Generating condensed evaluation template...")
EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED = generate_condensed_template()
print("Template generated successfully!")
print(EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED)
def llm_as_a_judge(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
experiment = run_experiment(
dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="condensed_prompt"
)
Self-refinement allows a Judge to improve its own evaluations by critically analyzing and adjusting its initial judgments. Instead of providing a static score, the model engages in an iterative process:
Generate an initial score based on the evaluation criteria.
Reflect on its reasoning, checking for inconsistencies or biases.
Refine the score if needed, ensuring alignment with the evaluation guidelines.
By incorporating this style of reasoning, the model can justify its decisions and self-correct errors.
refinement_text = """
---
After you have done the evaluation, follow these two steps:
1. Self-Critique
Review your initial score:
- Was it too harsh or lenient?
- Did it consider the full context?
- Would others agree with your score?
Explain any inconsistencies briefly.
2. Final Refinement
Based on your critique, adjust your score if necessary.
- Only output a number (1-10)
"""
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + refinement_text
def llm_as_a_judge(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
experiment = run_experiment(
dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="self_refinement"
)
To maximize the accuracy and fairness of our Judge, we will combine multiple optimization techniques. In this example, we will incorporate few-shot examples and style-invariant evaluation to ensure the model focuses on content rather than phrasing or tone.
By applying these techniques together, we aim to create a more reliable evaluation framework.
EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = (
EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples + style_invariant
)
def llm_as_a_judge(input):
response_classifications = llm_classify(
dataframe=pd.DataFrame([{"AI_Response": input["AI_Response"]}]),
template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,
model=OpenAIModel(model="gpt-4"),
rails=list(map(str, range(1, 11))),
provide_explanation=True,
)
score = response_classifications.iloc[0]["label"]
return int(score)
experiment = run_experiment(
dataset, task=llm_as_a_judge, evaluators=[evaluate_response], experiment_name="combined"
)
Techniques like few-shot examples, self-refinement, style-invariant evaluation, and prompt condensation each offer unique benefits, but their effectiveness will vary depending on the task.
By systematically testing and combining these approaches, you can refine your evaluation framework.
This tutorial will use Phoenix to compare the performance of different prompt optimization techniques.
You'll start by creating an experiment in Phoenix that can house the results of each of your resulting prompts. Next you'll use a series of prompt optimization techniques to improve the performance of a jailbreak classification task. Each technique will be applied to the same base prompt, and the results will be compared using Phoenix.
The techniques you'll use are:
Few Shot Examples: Adding a few examples to the prompt to help the model understand the task.
Meta Prompting: Prompting a model to generate a better prompt based on previous inputs, outputs, and expected outputs.
Prompt Gradients: Using the gradient of the prompt to optimize individual components of the prompt using embeddings.
DSPy Prompt Tuning: Using DSPy, an automated prompt tuning library, to optimize the prompt.
⚠️ This tutorial requires and OpenAI API key.
Let's get started!
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
Since we'll be running a series of experiments, we'll need a dataset of test cases that we can run each time. This dataset will be used to test the performance of each prompt optimization technique.
Next, you can define a base template for the prompt. We'll also save this template to Phoenix, so it can be tracked, versioned, and reused across experiments.
You should now see that prompt in Phoenix:
Next you'll need a task and evaluator for the experiment. A task is a function that will be run across each example in the dataset. The task is also the piece of your code that you'll change between each run of the experiment. To start off, the task is simply a call to GPT 3.5 Turbo with a basic prompt.
You'll also need an evaluator that will be used to test the performance of the task. The evaluator will be run across each example in the dataset after the task has been run. Here, because you have ground truth labels, you can use a simple function to check if the output of the task matches the expected output.
You can also instrument your code to send all models calls to Phoenix. This isn't necessary for the experiment to run, but it does mean all your experiment task runs will be tracked in Phoenix. The overall experiment score and evaluator runs will be tracked regardless of whether you instrument your code or not.
Now you can run the initial experiment. This will be the base prompt that you'll be optimizing.
You should now see the initial experiment results in Phoenix:
One common prompt optimization technique is to use few shot examples to guide the model's behavior.
Here you can add few shot examples to the prompt to help improve performance. Conviently, the dataset you uploaded in the last step contains a test set that you can use for this purpose.
Define a new prompt that includes the few shot examples. Prompts in Phoenix are automatically versioned, so saving the prompt with the same name will create a new version that can be used.
You'll notice you now have a new version of the prompt in Phoenix:
Define a new task with your new prompt:
Now you can run another experiment with the new prompt. The dataset of test cases and the evaluator will be the same as the previous experiment.
Meta prompting involves prompting a model to generate a better prompt, based on previous inputs, outputs, and expected outputs.
The experiment from round 1 serves as a great starting point for this technique, since it has each of those components.
Now construct a new prompt that will be used to generate a new prompt.
Now save that as a prompt in Phoenix:
Redefine the task, using the new prompt.
Prompt gradient optimization is a technique that uses the gradient of the prompt to optimize individual components of the prompt using embeddings. It involves:
Converting the prompt into an embedding.
Comparing the outputs of successful and failed prompts to find the gradient direction.
Moving in the gradient direction to optimize the prompt.
Here you'll define a function to get embeddings for prompts, and then use that function to calculate the gradient direction between successful and failed prompts.
Redefine the task, using the new prompt.
Finally, you can use an optimization library to optimize the prompt, like DSPy. supports each of the techniques you've used so far, and more.
DSPy makes a series of calls to optimize the prompt. It can be useful to see these calls in action. To do this, you can instrument the DSPy library using the OpenInference SDK, which will send all calls to Phoenix. This is optional, but it can be useful to have.
Now you'll setup the DSPy language model and define a prompt classification task.
Your classifier can now be used to make predictions as you would a normal LLM. It will expect a prompt
input and will output a label
prediction.
However, DSPy really shines when it comes to optimizing prompts. By defining a metric to measure successful runs, along with a training set of examples, you can use one of many different optimizers built into the library.
In this case, you'll use the MIPROv2
optimizer to find the best prompt for your task.
DSPy takes care of our prompts in this case, however you could still save the resulting prompt value in Phoenix:
Redefine the task, using the new prompt.
In the last example, you used GPT-3.5 Turbo to both run your pipeline, and optimize the prompt. However, you can also use a different model to optimize the prompt, and a different model to run your pipeline.
It can be useful to use a more powerful model for your optimization step, and a cheaper or faster model for your pipeline.
Here you'll use GPT-4o to optimize the prompt, and keep GPT-3.5 Turbo as your pipeline model.
Redefine the task, using the new prompt.
And just like that, you've run a series of prompt optimization techniques to improve the performance of a jailbreak classification task, and compared the results using Phoenix.
You should have a set of experiments that looks like this:
From here, you can check out more , and if you haven't already, ⭐️
!pip install -q "arize-phoenix>=8.0.0" datasets
import os
from getpass import getpass
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
from datasets import load_dataset
ds = load_dataset("jackhhao/jailbreak-classification")["train"]
ds = ds.to_pandas().sample(50)
ds.head()
import uuid
import phoenix as px
from phoenix.client import Client as PhoenixClient
unique_id = uuid.uuid4()
# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
dataframe=ds,
input_keys=["prompt"],
output_keys=["type"],
dataset_name=f"jailbreak-classification-{unique_id}",
)
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from phoenix.client.types import PromptVersion
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": "You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output 'benign' or 'jailbreak', no other words.",
},
{"role": "user", "content": "{{prompt}}"},
],
)
prompt_identifier = "jailbreak-classification"
prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="A prompt for classifying whether a given prompt is a jailbreak risk.",
version=PromptVersion.from_openai(params),
)
def test_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(**prompt.format(variables={"prompt": input["prompt"]}))
return resp.choices[0].message.content.strip()
def evaluate_response(output, expected):
return output.lower() == expected["type"].lower()
from openinference.instrumentation.openai import OpenAIInstrumentor
from phoenix.otel import register
tracer_provider = register(project_name="prompt-optimization")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
initial_experiment = run_experiment(
dataset,
task=test_prompt,
evaluators=[evaluate_response],
experiment_description="Initial base prompt",
experiment_name="initial-prompt",
experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)
from datasets import load_dataset
ds_test = load_dataset("jackhhao/jailbreak-classification")[
"test"
] # this time, load in the test set instead of the training set
few_shot_examples = ds_test.to_pandas().sample(10)
few_shot_template = """
You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output "benign" or "jailbreak", no other words.
Here are some examples of prompts and responses:
{examples}
"""
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": few_shot_template.format(examples=few_shot_examples)},
{"role": "user", "content": "{{prompt}}"},
],
)
few_shot_prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Few shot prompt",
version=PromptVersion.from_openai(params),
)
def test_prompt(input):
client = OpenAI()
prompt_vars = {"prompt": input["prompt"]}
resp = client.chat.completions.create(**few_shot_prompt.format(variables=prompt_vars))
return resp.choices[0].message.content.strip()
few_shot_experiment = run_experiment(
dataset,
task=test_prompt,
evaluators=[evaluate_response],
experiment_description="Prompt Optimization Technique #1: Few Shot Examples",
experiment_name="few-shot-examples",
experiment_metadata={"prompt": "prompt_id=" + few_shot_prompt.id},
)
# Access the experiment results from the first round as a dataframe
ground_truth_df = initial_experiment.as_dataframe()
# Sample 10 examples to use as meta prompting examples
ground_truth_df = ground_truth_df[:10]
# Create a new column with the examples in a single string
ground_truth_df["example"] = ground_truth_df.apply(
lambda row: f"Input: {row['input']}\nOutput: {row['output']}\nExpected Output: {row['expected']}",
axis=1,
)
ground_truth_df.head()
meta_prompt = """
You are an expert prompt engineer. You are given a prompt, and a list of examples.
Your job is to generate a new prompt that will improve the performance of the model.
Here are the examples:
{examples}
Here is the original prompt:
{prompt}
Here is the new prompt:
"""
original_base_prompt = (
prompt.format(variables={"prompt": "example prompt"}).get("messages")[0].get("content")
)
client = OpenAI()
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": meta_prompt.format(
prompt=original_base_prompt, examples=ground_truth_df["example"].to_string()
),
}
],
)
new_prompt = response.choices[0].message.content.strip()
new_prompt
if r"\{examples\}" in new_prompt:
new_prompt = new_prompt.format(examples=few_shot_examples)
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{"role": "system", "content": new_prompt},
{"role": "user", "content": "{{prompt}}"},
],
)
meta_prompt_result = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Meta prompt result",
version=PromptVersion.from_openai(params),
)
def test_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**meta_prompt_result.format(variables={"prompt": input["prompt"]})
)
return resp.choices[0].message.content.strip()
meta_prompting_experiment = run_experiment(
dataset,
task=test_prompt,
evaluators=[evaluate_response],
experiment_description="Prompt Optimization Technique #2: Meta Prompting",
experiment_name="meta-prompting",
experiment_metadata={"prompt": "prompt_id=" + meta_prompt_result.id},
)
import numpy as np
# First we'll define a function to get embeddings for prompts
def get_embedding(text):
client = OpenAI()
response = client.embeddings.create(model="text-embedding-ada-002", input=text)
return response.data[0].embedding
# Function to calculate gradient direction between successful and failed prompts
def calculate_prompt_gradient(successful_prompts, failed_prompts):
# Get embeddings for successful and failed prompts
successful_embeddings = [get_embedding(p) for p in successful_prompts]
failed_embeddings = [get_embedding(p) for p in failed_prompts]
# Calculate average embeddings
avg_successful = np.mean(successful_embeddings, axis=0)
avg_failed = np.mean(failed_embeddings, axis=0)
# Calculate gradient direction
gradient = avg_successful - avg_failed
return gradient / np.linalg.norm(gradient)
# Get successful and failed examples from our dataset
successful_examples = (
ground_truth_df[ground_truth_df["output"] == ground_truth_df["expected"].get("type")]["input"]
.apply(lambda x: x["prompt"])
.tolist()
)
failed_examples = (
ground_truth_df[ground_truth_df["output"] != ground_truth_df["expected"].get("type")]["input"]
.apply(lambda x: x["prompt"])
.tolist()
)
# Calculate the gradient direction
gradient = calculate_prompt_gradient(successful_examples[:5], failed_examples[:5])
# Function to optimize a prompt using the gradient
def optimize_prompt(base_prompt, gradient, step_size=0.1):
# Get base embedding
base_embedding = get_embedding(base_prompt)
# Move in gradient direction
optimized_embedding = base_embedding + step_size * gradient
# Use GPT to convert the optimized embedding back to text
client = OpenAI()
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are helping to optimize prompts. Given the original prompt and its embedding, generate a new version that maintains the core meaning but moves in the direction of the optimized embedding.",
},
{
"role": "user",
"content": f"Original prompt: {base_prompt}\nOptimized embedding direction: {optimized_embedding[:10]}...\nPlease generate an improved version that moves in this embedding direction.",
},
],
)
return response.choices[0].message.content.strip()
# Test the gradient-based optimization
gradient_prompt = optimize_prompt(original_base_prompt, gradient)
gradient_prompt
if r"\{examples\}" in gradient_prompt:
gradient_prompt = gradient_prompt.format(examples=few_shot_examples)
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": gradient_prompt,
}, # if your meta prompt includes few shot examples, make sure to include them here
{"role": "user", "content": "{{prompt}}"},
],
)
gradient_prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="Gradient prompt result",
version=PromptVersion.from_openai(params),
)
def test_gradient_prompt(input):
client = OpenAI()
resp = client.chat.completions.create(
**gradient_prompt.format(variables={"prompt": input["prompt"]})
)
return resp.choices[0].message.content.strip()
gradient_experiment = run_experiment(
dataset,
task=test_gradient_prompt,
evaluators=[evaluate_response],
experiment_description="Prompt Optimization Technique #3: Prompt Gradients",
experiment_name="gradient-optimization",
experiment_metadata={"prompt": "prompt_id=" + gradient_prompt.id},
)
!pip install -q dspy openinference-instrumentation-dspy
from openinference.instrumentation.dspy import DSPyInstrumentor
DSPyInstrumentor().instrument(tracer_provider=tracer_provider)
# Import DSPy and set up the language model
import dspy
# Configure DSPy to use OpenAI
turbo = dspy.LM(model="gpt-3.5-turbo")
dspy.settings.configure(lm=turbo)
# Define the prompt classification task
class PromptClassifier(dspy.Signature):
"""Classify if a prompt is benign or jailbreak."""
prompt = dspy.InputField()
label = dspy.OutputField(desc="either 'benign' or 'jailbreak'")
# Create the basic classifier
classifier = dspy.Predict(PromptClassifier)
classifier(prompt=ds.iloc[0].prompt)
def validate_classification(example, prediction, trace=None):
return example["label"] == prediction["label"]
# Prepare training data from previous examples
train_data = []
for _, row in ground_truth_df.iterrows():
example = dspy.Example(
prompt=row["input"]["prompt"], label=row["expected"]["type"]
).with_inputs("prompt")
train_data.append(example)
tp = dspy.MIPROv2(metric=validate_classification, auto="light")
optimized_classifier = tp.compile(classifier, trainset=train_data)
params = CompletionCreateParamsBase(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": optimized_classifier.signature.instructions,
}, # if your meta prompt includes few shot examples, make sure to include them here
{"role": "user", "content": "{{prompt}}"},
],
)
dspy_prompt = PhoenixClient().prompts.create(
name=prompt_identifier,
prompt_description="DSPy prompt result",
version=PromptVersion.from_openai(params),
)
# Create evaluation function using optimized classifier
def test_dspy_prompt(input):
result = optimized_classifier(prompt=input["prompt"])
return result.label
# Run experiment with DSPy-optimized classifier
dspy_experiment = run_experiment(
dataset,
task=test_dspy_prompt,
evaluators=[evaluate_response],
experiment_description="Prompt Optimization Technique #4: DSPy Prompt Tuning",
experiment_name="dspy-optimization",
experiment_metadata={"prompt": "prompt_id=" + dspy_prompt.id},
)
prompt_gen_lm = dspy.LM("gpt-4o")
tp = dspy.MIPROv2(
metric=validate_classification, auto="light", prompt_model=prompt_gen_lm, task_model=turbo
)
optimized_classifier_using_gpt_4o = tp.compile(classifier, trainset=train_data)
# Create evaluation function using optimized classifier
def test_dspy_prompt(input):
result = optimized_classifier_using_gpt_4o(prompt=input["prompt"])
return result.label
# Run experiment with DSPy-optimized classifier
dspy_experiment_using_gpt_4o = run_experiment(
dataset,
task=test_dspy_prompt,
evaluators=[evaluate_response],
experiment_description="Prompt Optimization Technique #5: DSPy Prompt Tuning with GPT-4o",
experiment_name="dspy-optimization-gpt-4o",
experiment_metadata={"prompt": "prompt_id=" + dspy_prompt.id},
)
Building a RAG pipeline and evaluating it with Phoenix Evals.
In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.
It has the the following sections:
Understanding Retrieval Augmented Generation (RAG).
Building RAG (with the help of a framework such as LlamaIndex).
Evaluating RAG with Phoenix Evals.
LLMs are trained on vast amounts of data, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.
In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.
RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.
There are five key stages within RAG, which will in turn be a part of any larger RAG application.
Loading: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
Indexing: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
Querying: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.
Now that we have understood the stages of RAG, let's build a pipeline. We will use for RAG and for evaluation.
During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.
For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation.
Let's use an to build our RAG pipeline.
Build a QueryEngine and start querying.
Check the response that you get from the query.
By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in vector_index.as_query_engine(similarity_top_k=k)
.
Let's check the text in each of these retrieved nodes.
Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.
We can access the traces by directly pulling the spans from the phoenix session.
Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.
We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.
Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.
While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.
In a RAG system, evaluation focuses on two critical aspects:
Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved
Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided.
For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.
For this tutorial, let's use Phoenix's llm_generate
to help us create the question-context pairs.
First, let's create a dataframe of all the document chunks that we have indexed.
Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.
The LLM has generated three questions per chunk. Let's take a quick look.
We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.
Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the ).
Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on explanations
which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.
We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.
Let's compute Normalized Discounted Cumulative Gain at 2 for all our retrieval steps. In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.
Let's also compute precision at 2 for all our retrieval steps.
Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)
Let's now view the results in a combined dataframe.
Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.
As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.
We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.
The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.
Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the .
Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.
Our QA Correctness score of 0.91
and a Hallucinations score 0.05
signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.
Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.
We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.
We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.
Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the documentation.
!pip install -qq "arize-phoenix[experimental,llama-index]>=2.0"
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio
nest_asyncio.apply()
import os
from getpass import getpass
import pandas as pd
import phoenix as px
from llama_index import SimpleDirectoryReader, VectorStoreIndex, set_global_handler
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
px.launch_app()
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# Define an LLM
llm = OpenAI(model="gpt-4")
# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What did the author do growing up?")
response_vector.response
'The author wrote short stories and worked on programming, specifically on an IBM 1401 computer in 9th grade.'
# First retrieved node
response_vector.source_nodes[0].get_text()
'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.\n\nI was puzzled by the 1401. I couldn\'t figure out what to do with it. And in retrospect there\'s not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn\'t have any data stored on punched cards. The only other option was to do things that didn\'t rely on any input, like calculate approximations of pi, but I didn\'t know enough math to do anything interesting of that type. So I\'m not surprised I can\'t remember any programs I wrote, because they can\'t have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn\'t. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager\'s expression made clear.\n\nWith microcomputers, everything changed.'
# Second retrieved node
response_vector.source_nodes[1].get_text()
"It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n[2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione). It's the everyday words that differ. So if you string together a lot of abstract concepts with a few simple verbs, you can make a little Italian go a long way.\n\n[3] I lived at Piazza San Felice 4, so my walk to the Accademia went straight down the spine of old Florence: past the Pitti, across the bridge, past Orsanmichele, between the Duomo and the Baptistery, and then up Via Ricasoli to Piazza San Marco."
print("phoenix URL", px.active_session().url)
spans_df = px.active_session().get_spans_dataframe()
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()
6aba9eee-91c9-4ee2-81e9-1bdae2eb435d
llm
LLM
NaN
NaN
cc9feb6a-30ba-4f32-af8d-8c62dd1b1b23
synthesize
CHAIN
What did the author do growing up?
NaN
8202dbe5-d17e-4939-abd8-153cad08bdca
embedding
EMBEDDING
NaN
NaN
aeadad73-485f-400b-bd9d-842abfaa460b
retrieve
RETRIEVER
What did the author do growing up?
[{'document.content': 'What I Worked On
Febru...
9e25c528-5e2f-4719-899a-8248bab290ec
query
CHAIN
What did the author do growing up?
NaN
spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()
aeadad73-485f-400b-bd9d-842abfaa460b
What did the author do growing up?
[{'document.content': 'What I Worked On
Febru...
# Let's construct a dataframe of just the documents that are in our index
document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
document_chunks_df.head()
0
What I Worked On\n\nFebruary 2021\n\nBefore co...
1
I was puzzled by the 1401. I couldn't figure o...
2
I remember vividly how impressed and envious I...
3
I couldn't have put this into words when I was...
4
This was more like it; this was what I had exp...
generate_questions_template = """\
Context information is below.
---------------------
{text}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
Output the questions in JSON format with the keys question_1, question_2, question_3.
"""
import json
from phoenix.evals import OpenAIModel, llm_generate
def output_parser(response: str, index: int):
try:
return json.loads(response)
except json.JSONDecodeError as e:
return {"__error__": str(e)}
questions_df = llm_generate(
dataframe=document_chunks_df,
template=generate_questions_template,
model=OpenAIModel(
model_name="gpt-3.5-turbo",
),
output_parser=output_parser,
concurrency=20,
)
questions_df.head()
0
What were the two main things the author worke...
What was the language the author used to write...
What was the author's clearest memory regardin...
1
What were the limitations of the 1401 computer...
How did microcomputers change the author's exp...
Why did the author's father buy a TRS-80 compu...
2
What was the author's first experience with co...
Why did the author decide to switch from study...
What were the two things that influenced the a...
3
What were the two things that inspired the aut...
What programming language did the author learn...
What was the author's undergraduate thesis about?
4
What was the author's undergraduate thesis about?
Which three grad schools did the author apply to?
What realization did the author have during th...
# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
questions_with_document_chunk_df["question"].notnull()
]
questions_with_document_chunk_df.head(10)
0
What I Worked On\n\nFebruary 2021\n\nBefore co...
What were the two main things the author worke...
1
I was puzzled by the 1401. I couldn't figure o...
What were the limitations of the 1401 computer...
2
I remember vividly how impressed and envious I...
What was the author's first experience with co...
3
I couldn't have put this into words when I was...
What were the two things that inspired the aut...
4
This was more like it; this was what I had exp...
What was the author's undergraduate thesis about?
5
Only Harvard accepted me, so that was where I ...
What realization did the author have during th...
6
So I decided to focus on Lisp. In fact, I deci...
What motivated the author to write a book abou...
7
Anyone who wanted one to play around with coul...
What realization did the author have while vis...
8
I knew intellectually that people made art — t...
What was the author's initial perception of pe...
9
Then one day in April 1990 a crack appeared in...
What was the author's initial plan for their d...
# First things first, let's reset phoenix
px.close_app()
px.launch_app()
🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://arize.com/docs/phoenix
<phoenix.session.session.ThreadSession at 0x2c6c785b0>
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
question = row["question"]
response_vector = query_engine.query(question)
print(f"Question: {question}\nAnswer: {response_vector.response}\n")
from phoenix.session.evaluation import get_retrieved_documents
retrieved_documents_df = get_retrieved_documents(px.active_session())
retrieved_documents_df
context.span_id
document_position
b375be95-8e5e-4817-a29f-e18f7aaa3e98
0
20e0f915-e089-4e8e-8314-b68ffdffd7d1
How does leaving YC affect the author's relati...
On one of them I realized I was ready to hand ...
0.820411
1
20e0f915-e089-4e8e-8314-b68ffdffd7d1
How does leaving YC affect the author's relati...
That was what it took for Rtm to offer unsolic...
0.815969
e4e68b51-dbc9-4154-85a4-5cc69382050d
0
4ad14fd2-0950-4b3f-9613-e1be5e51b5a4
Why did YC become a fund for a couple of years...
For example, one thing Julian had done for us ...
0.860981
1
4ad14fd2-0950-4b3f-9613-e1be5e51b5a4
Why did YC become a fund for a couple of years...
They were an impressive group. That first batc...
0.849695
27ba6b6f-828b-4732-bfcc-3262775cd71f
0
d62fb8e8-4247-40ac-8808-818861bfb059
Why did the author choose the name 'Y Combinat...
Screw the VCs who were taking so long to make ...
0.868981
...
...
...
...
...
...
353f152c-44ce-4f3e-a323-0caa90f4c078
1
6b7bebf6-bed3-45fd-828a-0730d8f358ba
What was the author's first experience with co...
What I Worked On\n\nFebruary 2021\n\nBefore co...
0.877719
16de2060-dd9b-4622-92a1-9be080564a40
0
6ce5800d-7186-414e-a1cf-1efb8d39c8d4
What were the limitations of the 1401 computer...
I was puzzled by the 1401. I couldn't figure o...
0.847688
1
6ce5800d-7186-414e-a1cf-1efb8d39c8d4
What were the limitations of the 1401 computer...
I remember vividly how impressed and envious I...
0.836979
e996c90f-4ea9-4f7c-b145-cf461de7d09b
0
a328a85a-aadd-44f5-b49a-2748d0bd4d2f
What were the two main things the author worke...
What I Worked On\n\nFebruary 2021\n\nBefore co...
0.843280
1
a328a85a-aadd-44f5-b49a-2748d0bd4d2f
What were the two main things the author worke...
Then one day in April 1990 a crack appeared in...
0.822055
from phoenix.evals import (
RelevanceEvaluator,
run_evals,
)
relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))
retrieved_documents_relevance_df = run_evals(
evaluators=[relevance_evaluator],
dataframe=retrieved_documents_df,
provide_explanation=True,
concurrency=20,
)[0]
retrieved_documents_relevance_df.head()
documents_with_relevance_df = pd.concat(
[retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df
import numpy as np
from sklearn.metrics import ndcg_score
def _compute_ndcg(df: pd.DataFrame, k: int):
"""Compute NDCG@k in the presence of missing values"""
n = max(2, len(df))
eval_scores = np.zeros(n)
doc_scores = np.zeros(n)
eval_scores[: len(df)] = df.eval_score
doc_scores[: len(df)] = df.document_score
try:
return ndcg_score([eval_scores], [doc_scores], k=k)
except ValueError:
return np.nan
ndcg_at_2 = pd.DataFrame(
{"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)
ndcg_at_2
precision_at_2 = pd.DataFrame(
{
"score": documents_with_relevance_df.groupby("context.span_id").apply(
lambda x: x.eval_score[:2].sum(skipna=False) / 2
)
}
)
precision_at_2
hit = pd.DataFrame(
{
"hit": documents_with_relevance_df.groupby("context.span_id").apply(
lambda x: x.eval_score[:2].sum(skipna=False) > 0
)
}
)
retrievals_df = px.active_session().get_spans_dataframe("span_kind == 'RETRIEVER'")
rag_evaluation_dataframe = pd.concat(
[
retrievals_df["attributes.input.value"],
ndcg_at_2.add_prefix("ncdg@2_"),
precision_at_2.add_prefix("precision@2_"),
hit,
],
axis=1,
)
rag_evaluation_dataframe
# Aggregate the scores across the retrievals
results = rag_evaluation_dataframe.mean(numeric_only=True)
results
ncdg@2_score 0.913450
precision@2_score 0.804598
hit 0.936782
dtype: float64
from phoenix.trace import DocumentEvaluations, SpanEvaluations
px.Client().log_evaluations(
SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)
from phoenix.session.evaluation import get_qa_with_reference
qa_with_reference_df = get_qa_with_reference(px.active_session())
qa_with_reference_df
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
run_evals,
)
qa_evaluator = QAEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))
qa_correctness_eval_df, hallucination_eval_df = run_evals(
evaluators=[qa_evaluator, hallucination_evaluator],
dataframe=qa_with_reference_df,
provide_explanation=True,
concurrency=20,
)
qa_correctness_eval_df.head()
hallucination_eval_df.head()
qa_correctness_eval_df.mean(numeric_only=True)
score 0.931034
dtype: float64
hallucination_eval_df.mean(numeric_only=True)
score 0.051724
dtype: float64
from phoenix.trace import SpanEvaluations
px.Client().log_evaluations(
SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)
Sending Evaluations: 100%|██████████| 348/348 [00:00<00:00, 415.37it/s]
print("phoenix URL", px.active_session().url)
phoenix URL http://localhost:6006/
This notebook serves as an end-to-end example of how to trace and evaluate an agent. The example uses a "talk-to-your-data" agent as its example.
The notebook shows examples of:
Manually instrumenting an agent using Phoenix decorators
Evaluating function calling accuracy using LLM as a Judge
Evaluating function calling accuracy by comparing to ground truth
Evaluating SQL query generation
Evaluating Python code generation
Evaluating the path of an agent
!pip install -q openai "arize-phoenix>=8.8.0" "arize-phoenix-otel>=0.8.0" openinference-instrumentation-openai python-dotenv duckdb "openinference-instrumentation>=0.1.21"
import dotenv
dotenv.load_dotenv()
import json
import os
from getpass import getpass
import duckdb
import pandas as pd
from IPython.display import Markdown
from openai import OpenAI
from openinference.instrumentation import (
suppress_tracing,
)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import StatusCode
from pydantic import BaseModel, Field
from tqdm import tqdm
from phoenix.otel import register
if os.getenv("OPENAI_API_KEY") is None:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI()
model = "gpt-4o-mini"
project_name = "talk-to-your-data-agent"
Sign up for a free instance of Phoenix Cloud to get your API key. If you'd prefer, you can instead self-host Phoenix.
if os.getenv("PHOENIX_API_KEY") is None:
os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"
tracer_provider = register(
project_name=project_name,
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
tracer = tracer_provider.get_tracer(__name__)
Your agent will interact with a local database. Start by loading in that data:
store_sales_df = pd.read_parquet(
"https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/llama-index/Store_Sales_Price_Elasticity_Promotions_Data.parquet"
)
store_sales_df.head()
Now you can define your agent tools.
SQL_GENERATION_PROMPT = """
Generate an SQL query based on a prompt. Do not reply with anything besides the SQL query.
The prompt is: {prompt}
The available columns are: {columns}
The table name is: {table_name}
"""
def generate_sql_query(prompt: str, columns: list, table_name: str) -> str:
"""Generate an SQL query based on a prompt"""
formatted_prompt = SQL_GENERATION_PROMPT.format(
prompt=prompt, columns=columns, table_name=table_name
)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": formatted_prompt}],
)
return response.choices[0].message.content
@tracer.tool()
def lookup_sales_data(prompt: str) -> str:
"""Implementation of sales data lookup from parquet file using SQL"""
try:
table_name = "sales"
# Read the parquet file into a DuckDB table
duckdb.sql(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM store_sales_df")
print(store_sales_df.columns)
print(table_name)
sql_query = generate_sql_query(prompt, store_sales_df.columns, table_name)
sql_query = sql_query.strip()
sql_query = sql_query.replace("```sql", "").replace("```", "")
with tracer.start_as_current_span(
"execute_sql_query", openinference_span_kind="chain"
) as span:
span.set_input(value=sql_query)
# Execute the SQL query
result = duckdb.sql(sql_query).df()
span.set_output(value=str(result))
span.set_status(StatusCode.OK)
return result.to_string()
except Exception as e:
return f"Error accessing data: {str(e)}"
example_data = lookup_sales_data("Show me all the sales for store 1320 on November 1st, 2021")
example_data
class VisualizationConfig(BaseModel):
chart_type: str = Field(..., description="Type of chart to generate")
x_axis: str = Field(..., description="Name of the x-axis column")
y_axis: str = Field(..., description="Name of the y-axis column")
title: str = Field(..., description="Title of the chart")
@tracer.chain()
def extract_chart_config(data: str, visualization_goal: str) -> dict:
"""Generate chart visualization configuration
Args:
data: String containing the data to visualize
visualization_goal: Description of what the visualization should show
Returns:
Dictionary containing line chart configuration
"""
prompt = f"""Generate a chart configuration based on this data: {data}
The goal is to show: {visualization_goal}"""
response = client.beta.chat.completions.parse(
model=model,
messages=[{"role": "user", "content": prompt}],
response_format=VisualizationConfig,
)
try:
# Extract axis and title info from response
content = response.choices[0].message.content
# Return structured chart config
return {
"chart_type": content.chart_type,
"x_axis": content.x_axis,
"y_axis": content.y_axis,
"title": content.title,
"data": data,
}
except Exception:
return {
"chart_type": "line",
"x_axis": "date",
"y_axis": "value",
"title": visualization_goal,
"data": data,
}
@tracer.chain()
def create_chart(config: VisualizationConfig) -> str:
"""Create a chart based on the configuration"""
prompt = f"""Write python code to create a chart based on the following configuration.
Only return the code, no other text.
config: {config}"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
code = response.choices[0].message.content
code = code.replace("```python", "").replace("```", "")
code = code.strip()
return code
@tracer.tool()
def generate_visualization(data: str, visualization_goal: str) -> str:
"""Generate a visualization based on the data and goal"""
config = extract_chart_config(data, visualization_goal)
code = create_chart(config)
return code
# code = generate_visualization(example_data, "A line chart of sales over each day in november.")
@tracer.tool()
def run_python_code(code: str) -> str:
"""Execute Python code in a restricted environment"""
# Create restricted globals/locals dictionaries with plotting libraries
restricted_globals = {
"__builtins__": {
"print": print,
"len": len,
"range": range,
"sum": sum,
"min": min,
"max": max,
"int": int,
"float": float,
"str": str,
"list": list,
"dict": dict,
"tuple": tuple,
"set": set,
"round": round,
"__import__": __import__,
"json": __import__("json"),
},
"plt": __import__("matplotlib.pyplot"),
"pd": __import__("pandas"),
"np": __import__("numpy"),
"sns": __import__("seaborn"),
}
try:
# Execute code in restricted environment
exec_locals = {}
exec(code, restricted_globals, exec_locals)
# Capture any printed output or return the plot
exec_locals.get("__builtins__", {}).get("_", "")
if "plt" in exec_locals:
return exec_locals["plt"]
# Try to parse output as JSON before returning
return "Code executed successfully"
except Exception as e:
return f"Error executing code: {str(e)}"
@tracer.tool()
def analyze_sales_data(prompt: str, data: str) -> str:
"""Implementation of AI-powered sales data analysis"""
# Construct prompt based on analysis type and data subset
prompt = f"""Analyze the following data: {data}
Your job is to answer the following question: {prompt}"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
analysis = response.choices[0].message.content
return analysis if analysis else "No analysis could be generated"
# analysis = analyze_sales_data("What is the most popular product SKU?", example_data)
# analysis
You'll need to pass your tool descriptions into your agent router. The following code allows you to easily do so:
# Define tools/functions that can be called by the model
tools = [
{
"type": "function",
"function": {
"name": "lookup_sales_data",
"description": "Look up data from Store Sales Price Elasticity Promotions dataset",
"parameters": {
"type": "object",
"properties": {
"prompt": {
"type": "string",
"description": "The unchanged prompt that the user provided.",
}
},
"required": ["prompt"],
},
},
},
{
"type": "function",
"function": {
"name": "analyze_sales_data",
"description": "Analyze sales data to extract insights",
"parameters": {
"type": "object",
"properties": {
"data": {
"type": "string",
"description": "The lookup_sales_data tool's output.",
},
"prompt": {
"type": "string",
"description": "The unchanged prompt that the user provided.",
},
},
"required": ["data", "prompt"],
},
},
},
{
"type": "function",
"function": {
"name": "generate_visualization",
"description": "Generate Python code to create data visualizations",
"parameters": {
"type": "object",
"properties": {
"data": {
"type": "string",
"description": "The lookup_sales_data tool's output.",
},
"visualization_goal": {
"type": "string",
"description": "The goal of the visualization.",
},
},
"required": ["data", "visualization_goal"],
},
},
},
# {
# "type": "function",
# "function": {
# "name": "run_python_code",
# "description": "Run Python code in a restricted environment",
# "parameters": {
# "type": "object",
# "properties": {
# "code": {"type": "string", "description": "The Python code to run."}
# },
# "required": ["code"]
# }
# }
# }
]
# Dictionary mapping function names to their implementations
tool_implementations = {
"lookup_sales_data": lookup_sales_data,
"analyze_sales_data": analyze_sales_data,
"generate_visualization": generate_visualization,
# "run_python_code": run_python_code
}
With the tools defined, you're ready to define the main routing and tool call handling steps of your agent.
@tracer.chain()
def handle_tool_calls(tool_calls, messages):
for tool_call in tool_calls:
function = tool_implementations[tool_call.function.name]
function_args = json.loads(tool_call.function.arguments)
result = function(**function_args)
messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})
return messages
def start_main_span(messages):
print("Starting main span with messages:", messages)
with tracer.start_as_current_span("AgentRun", openinference_span_kind="agent") as span:
span.set_input(value=messages)
ret = run_agent(messages)
print("Main span completed with return value:", ret)
span.set_output(value=ret)
span.set_status(StatusCode.OK)
return ret
def run_agent(messages):
print("Running agent with messages:", messages)
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
print("Converted string message to list format")
# Check and add system prompt if needed
if not any(
isinstance(message, dict) and message.get("role") == "system" for message in messages
):
system_prompt = {
"role": "system",
"content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
}
messages.append(system_prompt)
print("Added system prompt to messages")
while True:
# Router call span
print("Starting router call span")
with tracer.start_as_current_span(
"router_call",
openinference_span_kind="chain",
) as span:
span.set_input(value=messages)
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
messages.append(response.choices[0].message.model_dump())
tool_calls = response.choices[0].message.tool_calls
print("Received response with tool calls:", bool(tool_calls))
span.set_status(StatusCode.OK)
if tool_calls:
# Tool calls span
print("Processing tool calls")
messages = handle_tool_calls(tool_calls, messages)
span.set_output(value=tool_calls)
else:
print("No tool calls, returning final response")
span.set_output(value=response.choices[0].message.content)
return response.choices[0].message.content
Your agent is now good to go! Let's try it out with some example questions:
ret = start_main_span([{"role": "user", "content": "Create a line chart showing sales in 2021"}])
print(Markdown(ret))
agent_questions = [
"What was the most popular product SKU?",
"What was the total revenue across all stores?",
"Which store had the highest sales volume?",
"Create a bar chart showing total sales by store",
"What percentage of items were sold on promotion?",
"Plot daily sales volume over time",
"What was the average transaction value?",
"Create a box plot of transaction values",
"Which products were frequently purchased together?",
"Plot a line graph showing the sales trend over time with a 7-day moving average",
]
for question in tqdm(agent_questions, desc="Processing questions"):
try:
ret = start_main_span([{"role": "user", "content": question}])
except Exception as e:
print(f"Error processing question: {question}")
print(e)
continue
So your agent looks like it's working, but how can you measure its performance?
OpenAIInstrumentor().uninstrument() # Uninstrument the OpenAI client to avoid capturing LLM as a Judge evaluation calls in your same project.
import nest_asyncio
import phoenix as px
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify
from phoenix.experiments import evaluate_experiment, run_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery
nest_asyncio.apply()
px_client = px.Client()
eval_model = OpenAIModel(model="gpt-4o-mini")
This first evaluation will evaluate your agent router choices using another LLM.
It follows a standard pattern:
Export traces from Phoenix
Prepare those exported traces in a dataframe with the correct columns
Use llm_classify
to run a standard template across each row of that dataframe and produce an eval label
Upload the results back into Phoenix
query = (
SpanQuery()
.where(
"span_kind == 'LLM'",
)
.select(question="input.value", output_messages="llm.output_messages")
)
# The Phoenix Client can take this query and return the dataframe.
tool_calls_df = px.Client().query_spans(query, project_name=project_name, timeout=None)
tool_calls_df.dropna(subset=["output_messages"], inplace=True)
def get_tool_call(outputs):
if outputs[0].get("message").get("tool_calls"):
return (
outputs[0]
.get("message")
.get("tool_calls")[0]
.get("tool_call")
.get("function")
.get("name")
)
else:
return "No tool used"
tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_call)
tool_calls_df.head()
tool_call_eval = llm_classify(
dataframe=tool_calls_df,
template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
"{tool_definitions}",
"generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
),
rails=["correct", "incorrect"],
model=eval_model,
provide_explanation=True,
)
tool_call_eval["score"] = tool_call_eval.apply(
lambda x: 1 if x["label"] == "correct" else 0, axis=1
)
tool_call_eval.head()
px.Client().log_evaluations(
SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval),
)
You should now see eval labels in Phoenix.
The above example works, however if you have ground truth labled data, you can use that data to get an even more accurate measure of your router's performance by running an experiments.
Experiments also follow a standard step-by-step process in Phoenix:
Create a dataset of test cases, and optionally, expected outputs
Create a task to run on each test case - usually this is invoking your agent or a specifc step of it
Create evaluator(s) to run on each output of your task
Visualize results in Phoenix
import uuid
id = str(uuid.uuid4())
agent_tool_responses = {
"What was the most popular product SKU?": "lookup_sales_data, analyze_sales_data",
"What was the total revenue across all stores?": "lookup_sales_data, analyze_sales_data",
"Which store had the highest sales volume?": "lookup_sales_data, analyze_sales_data",
"Create a bar chart showing total sales by store": "generate_visualization, lookup_sales_data, run_python_code",
"What percentage of items were sold on promotion?": "lookup_sales_data, analyze_sales_data",
"Plot daily sales volume over time": "generate_visualization, lookup_sales_data, run_python_code",
"What was the average transaction value?": "lookup_sales_data, analyze_sales_data",
"Create a box plot of transaction values": "generate_visualization, lookup_sales_data, run_python_code",
"Which products were frequently purchased together?": "lookup_sales_data, analyze_sales_data",
"Plot a line graph showing the sales trend over time with a 7-day moving average": "generate_visualization, lookup_sales_data, run_python_code",
}
tool_calling_df = pd.DataFrame(agent_tool_responses.items(), columns=["question", "tool_calls"])
dataset = px_client.upload_dataset(
dataframe=tool_calling_df,
dataset_name=f"tool_calling_ground_truth_{id}",
input_keys=["question"],
output_keys=["tool_calls"],
)
For your task, you can simply run just the router call of your agent:
def run_router_step(example: Example) -> str:
messages = [
{
"role": "system",
"content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
}
]
messages.append({"role": "user", "content": example.input.get("question")})
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
tool_calls = []
for tool_call in response.choices[0].message.tool_calls:
tool_calls.append(tool_call.function.name)
return tool_calls
Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:
def tools_match(expected: str, output: str) -> bool:
expected_tools = expected.get("tool_calls").split(", ")
return expected_tools == output
experiment = run_experiment(
dataset,
run_router_step,
evaluators=[tools_match],
experiment_name="Tool Calling Eval",
experiment_description="Evaluating the tool calling step of the agent",
)
The next piece of your agent to evaluate is its tools. Each tool is usually evaluated differently - we've included some examples below. If you need other ideas, Phoenix's built-in evaluators give you an idea of other metrics to use.
# This step will be replaced by a human annotated set of ground truth data, instead of generated examples
db_lookup_questions = [
"What was the most popular product SKU?",
"Which store had the highest total sales value?",
"How many items were sold on promotion?",
"What was the average quantity sold per transaction?",
"Which product class code generated the most revenue?",
"What day of the week had the highest sales volume?",
"How many unique stores made sales?",
"What was the highest single transaction value?",
"Which products were frequently sold together?",
"What's the trend in sales over time?",
]
expected_results = []
for question in tqdm(db_lookup_questions, desc="Processing SQL lookup questions"):
try:
with suppress_tracing():
expected_results.append(lookup_sales_data(question))
except Exception as e:
print(f"Error processing question: {question}")
print(e)
db_lookup_questions.remove(question)
# Create a DataFrame with the questions
questions_df = pd.DataFrame({"question": db_lookup_questions, "expected_result": expected_results})
display(questions_df)
dataset = px_client.upload_dataset(
dataframe=questions_df,
dataset_name=f"sales_db_lookup_questions_{id}",
input_keys=["question"],
output_keys=["expected_result"],
)
def run_sql_query(example: Example) -> str:
with suppress_tracing():
return lookup_sales_data(example.input.get("question"))
def evaluate_sql_result(output: str, expected: str) -> bool:
# Extract just the numbers from both strings
result_nums = "".join(filter(str.isdigit, output))
expected_nums = "".join(filter(str.isdigit, expected.get("expected_result")))
return result_nums == expected_nums
experiment = run_experiment(
dataset,
run_sql_query,
evaluators=[evaluate_sql_result],
experiment_name="SQL Query Eval",
experiment_description="Evaluating the SQL query generation step of the agent",
)
# Replace this with a human annotated set of ground truth data, instead of generated examples
code_generation_questions = [
"Create a bar chart showing total sales by store",
"Plot daily sales volume over time",
"Plot a line graph showing the sales trend over time with a 7-day moving average",
"Create a histogram of quantities sold per transaction",
"Generate a pie chart showing sales distribution across product classes",
"Create a stacked bar chart showing promotional vs non-promotional sales by store",
"Generate a heatmap of sales by day of week and store number",
"Plot a line chart comparing sales trends between top 5 stores",
]
example_data = []
chart_configs = []
for question in tqdm(code_generation_questions[:], desc="Processing code generation questions"):
try:
with suppress_tracing():
example_data.append(lookup_sales_data(question))
chart_configs.append(json.dumps(extract_chart_config(example_data[-1], question)))
except Exception as e:
print(f"Error processing question: {question}")
print(e)
code_generation_questions.remove(question)
code_generation_df = pd.DataFrame(
{
"question": code_generation_questions,
"example_data": example_data,
"chart_configs": chart_configs,
}
)
dataset = px_client.upload_dataset(
dataframe=code_generation_df,
dataset_name=f"code_generation_questions_{id}",
input_keys=["question", "example_data", "chart_configs"],
)
def run_code_generation(example: Example) -> str:
with suppress_tracing():
chart_config = extract_chart_config(
data=example.input.get("example_data"), visualization_goal=example.input.get("question")
)
code = generate_visualization(
visualization_goal=example.input.get("question"), data=example.input.get("example_data")
)
return {"code": code, "chart_config": chart_config}
In this case, you don't have ground truth data to compare to. Instead you can just use a simple code evaluator: trying to run the generated code and catching any errors.
def code_is_runnable(output: str) -> bool:
"""Check if the code is runnable"""
output = output.get("code")
output = output.strip()
output = output.replace("```python", "").replace("```", "")
try:
exec(output)
return True
except Exception:
return False
def evaluate_chart_config(output: str, expected: str) -> bool:
return output.get("chart_config") == expected.get("chart_config")
experiment = run_experiment(
dataset,
run_code_generation,
evaluators=[code_is_runnable, evaluate_chart_config],
experiment_name="Code Generation Eval",
experiment_description="Evaluating the code generation step of the agent",
)
Finally, the last piece of your agent to evaluate is its path. This is important to evaluate to understand how efficient your agent is in its execution. Does it need to call the same tool multiple times? Does it skip steps it shouldn't, and have to backtrack later? Convergence or path evals can tell you this.
Convergence evals operate slightly differently. The one you'll use below relies on knowing the minimum number of steps taken by the agent for a given type of query. Instead of just running an experiment, you'll run an experiment then after it completes, attach a second evaluator to calculate convergence.
The workflow is as follows:
Create a dataset of the same type of question, phrased different ways each time - the agent should take the same path for each, but you'll often find it doesn't.
Create a task that runs the agent on each question, while tracking the number of steps it takes.
Run the experiment without an evaluator.
Calculate the minimum number of steps taken to complete the task.
Create an evaluator that compares the steps taken of each run against that min step number.
Run this evaluator on your experiment from step 3.
View your results in Phoenix
# Replace this with a human annotated set of ground truth data, instead of generated examples
convergence_questions = [
"What was the average quantity sold per transaction?",
"What is the mean number of items per sale?",
"Calculate the typical quantity per transaction",
"Show me the average number of units sold in each transaction",
"What's the mean transaction size in terms of quantity?",
"On average, how many items were purchased per transaction?",
"What is the average basket size per sale?",
"Calculate the mean number of products per purchase",
"What's the typical number of units per order?",
"Find the average quantity of items in each transaction",
"What is the average number of products bought per purchase?",
"Tell me the mean quantity of items in a typical transaction",
"How many items does a customer buy on average per transaction?",
"What's the usual number of units in each sale?",
"Calculate the average basket quantity per order",
"What is the typical amount of products per transaction?",
"Show the mean number of items customers purchase per visit",
"What's the average quantity of units per shopping trip?",
"How many products do customers typically buy in one transaction?",
"What is the standard basket size in terms of quantity?",
]
convergence_df = pd.DataFrame({"question": convergence_questions})
dataset = px_client.upload_dataset(
dataframe=convergence_df, dataset_name="convergence_questions", input_keys=["question"]
)
def format_message_steps(messages):
"""
Convert a list of message objects into a readable format that shows the steps taken.
Args:
messages (list): A list of message objects containing role, content, tool calls, etc.
Returns:
str: A readable string showing the steps taken.
"""
steps = []
for message in messages:
role = message.get("role")
if role == "user":
steps.append(f"User: {message.get('content')}")
elif role == "system":
steps.append("System: Provided context")
elif role == "assistant":
if message.get("tool_calls"):
for tool_call in message["tool_calls"]:
tool_name = tool_call["function"]["name"]
steps.append(f"Assistant: Called tool '{tool_name}'")
else:
steps.append(f"Assistant: {message.get('content')}")
elif role == "tool":
steps.append(f"Tool response: {message.get('content')}")
return "\n".join(steps)
def run_agent_and_track_path(example: Example) -> str:
print("Starting main span with messages:", example.input.get("question"))
messages = [{"role": "user", "content": example.input.get("question")}]
ret = run_agent_messages(messages)
return {"path_length": len(ret), "messages": format_message_steps(ret)}
def run_agent_messages(messages):
print("Running agent with messages:", messages)
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
print("Converted string message to list format")
# Check and add system prompt if needed
if not any(
isinstance(message, dict) and message.get("role") == "system" for message in messages
):
system_prompt = {
"role": "system",
"content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
}
messages.append(system_prompt)
print("Added system prompt to messages")
while True:
# Router call span
print("Starting router")
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
messages.append(response.choices[0].message.model_dump())
tool_calls = response.choices[0].message.tool_calls
print("Received response with tool calls:", bool(tool_calls))
if tool_calls:
# Tool calls span
print("Processing tool calls")
tool_calls = response.choices[0].message.tool_calls
messages = handle_tool_calls(tool_calls, messages)
else:
print("No tool calls, returning final response")
return messages
experiment = run_experiment(
dataset,
run_agent_and_track_path,
experiment_name="Convergence Eval",
experiment_description="Evaluating the convergence of the agent",
)
experiment.as_dataframe()
outputs = experiment.as_dataframe()["output"].to_dict().values()
optimal_path_length = min(
output.get("path_length")
for output in outputs
if output and output.get("path_length") is not None
)
print(f"The optimal path length is {optimal_path_length}")
@create_evaluator(name="Convergence Eval", kind="CODE")
def evaluate_path_length(output: str) -> float:
if output and output.get("path_length"):
return optimal_path_length / float(output.get("path_length"))
else:
return 0
experiment = evaluate_experiment(experiment, evaluators=[evaluate_path_length])
As an optional final step, you can combine all the evaluators and experiments above into a single experiment. This requires some more advanced data wrangling, but gives you a single report on your agent's performance.
def process_messages(messages):
tool_calls = []
tool_responses = []
final_output = None
for i, message in enumerate(messages):
# Extract tool calls
if "tool_calls" in message and message["tool_calls"]:
for tool_call in message["tool_calls"]:
tool_name = tool_call["function"]["name"]
tool_input = tool_call["function"]["arguments"]
tool_calls.append(tool_name)
# Prepare tool response structure with tool name and input
tool_responses.append(
{"tool_name": tool_name, "tool_input": tool_input, "tool_response": None}
)
# Extract tool responses
if message["role"] == "tool" and "tool_call_id" in message:
for tool_response in tool_responses:
if message["tool_call_id"] in message.values():
tool_response["tool_response"] = message["content"]
# Extract final output
if (
message["role"] == "assistant"
and not message.get("tool_calls")
and not message.get("function_call")
):
final_output = message["content"]
result = {
"tool_calls": tool_calls,
"tool_responses": tool_responses,
"final_output": final_output,
"unchanged_messages": messages,
"path_length": len(messages),
}
return result
def run_agent_and_track_path_combined(example: Example) -> str:
print("Starting main span with messages:", example.input.get("question"))
messages = [{"role": "user", "content": example.input.get("question")}]
ret = run_agent_messages_combined(messages)
return process_messages(ret)
def run_agent_messages_combined(messages):
print("Running agent with messages:", messages)
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
print("Converted string message to list format")
# Check and add system prompt if needed
if not any(
isinstance(message, dict) and message.get("role") == "system" for message in messages
):
system_prompt = {
"role": "system",
"content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
}
messages.append(system_prompt)
print("Added system prompt to messages")
while True:
# Router call span
print("Starting router")
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
messages.append(response.choices[0].message.model_dump())
tool_calls = response.choices[0].message.tool_calls
print("Received response with tool calls:", bool(tool_calls))
if tool_calls:
# Tool calls span
print("Processing tool calls")
tool_calls = response.choices[0].message.tool_calls
messages = handle_tool_calls(tool_calls, messages)
else:
print("No tool calls, returning final response")
return messages
generate_sql_query("What was the most popular product SKU?", store_sales_df.columns, "sales")
overall_experiment_questions = [
{
"question": "What was the most popular product SKU?",
"sql_result": " SKU_Coded Total_Qty_Sold 0 6200700 52262.0",
},
{
"question": "What was the total revenue across all stores?",
"sql_result": " Total_Revenue 0 1.327264e+07",
},
{
"question": "Which store had the highest sales volume?",
"sql_result": " Store_Number Total_Sales_Volume 0 2970 59322.0",
},
{
"question": "Create a bar chart showing total sales by store",
"sql_result": " Store_Number Total_Sales 0 880 420302.088397 1 1650 580443.007953 2 4180 272208.118542 3 550 229727.498752 4 1100 497509.528013 5 3300 619660.167018 6 3190 335035.018792 7 2970 836341.327191 8 3740 359729.808228 9 2530 324046.518720 10 4400 95745.620250 11 1210 508393.767785 12 330 370503.687331 13 2750 453664.808068 14 1980 242290.828499 15 1760 350747.617798 16 3410 410567.848126 17 990 378433.018639 18 4730 239711.708869 19 4070 322307.968330 20 3080 495458.238811 21 2090 309996.247965 22 1320 592832.067579 23 2640 308990.318559 24 1540 427777.427815 25 4840 389056.668316 26 2860 132320.519487 27 2420 406715.767402 28 770 292968.918642 29 3520 145701.079372 30 660 343594.978075 31 3630 405034.547846 32 2310 412579.388504 33 2200 361173.288199 34 1870 401070.997685",
},
{
"question": "What percentage of items were sold on promotion?",
"sql_result": " Promotion_Percentage 0 0.625596",
},
{
"question": "What was the average transaction value?",
"sql_result": " Average_Transaction_Value 0 19.018132",
},
{
"question": "Create a line chart showing sales in 2021",
"sql_result": " sale_month total_quantity_sold total_sales_value 0 2021-11-01 43056.0 499984.428193 1 2021-12-01 75724.0 910982.118423",
},
]
overall_experiment_questions[0]["sql_generated"] = generate_sql_query(
overall_experiment_questions[0]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[1]["sql_generated"] = generate_sql_query(
overall_experiment_questions[1]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[2]["sql_generated"] = generate_sql_query(
overall_experiment_questions[2]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[3]["sql_generated"] = generate_sql_query(
overall_experiment_questions[3]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[4]["sql_generated"] = generate_sql_query(
overall_experiment_questions[4]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[5]["sql_generated"] = generate_sql_query(
overall_experiment_questions[5]["question"], store_sales_df.columns, "sales"
)
overall_experiment_questions[6]["sql_generated"] = generate_sql_query(
overall_experiment_questions[6]["question"], store_sales_df.columns, "sales"
)
print(overall_experiment_questions[6])
# overall_experiment_df = pd.DataFrame(overall_experiment_questions)
# dataset = px_client.upload_dataset(dataframe=overall_experiment_df, dataset_name="overall_experiment_questions_all", input_keys=["question"], output_keys=["sql_result"])
print(overall_experiment_questions[6])
[
{
"question": "What was the most popular product SKU?",
"sql_result": " SKU_Coded Total_Qty_Sold 0 6200700 52262.0",
"sql_generated": "```sql\nSELECT SKU_Coded, SUM(Qty_Sold) AS Total_Qty_Sold\nFROM sales\nGROUP BY SKU_Coded\nORDER BY Total_Qty_Sold DESC\nLIMIT 1;\n```",
},
{
"question": "What was the total revenue across all stores?",
"sql_result": " Total_Revenue 0 1.327264e+07",
"sql_generated": "```sql\nSELECT SUM(Total_Sale_Value) AS Total_Revenue\nFROM sales;\n```",
},
{
"question": "Which store had the highest sales volume?",
"sql_result": " Store_Number Total_Sales_Volume 0 2970 59322.0",
"sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales_Volume\nFROM sales\nGROUP BY Store_Number\nORDER BY Total_Sales_Volume DESC\nLIMIT 1;\n```",
},
{
"question": "Create a bar chart showing total sales by store",
"sql_result": " Store_Number Total_Sales 0 880 420302.088397 1 1650 580443.007953 2 4180 272208.118542 3 550 229727.498752 4 1100 497509.528013 5 3300 619660.167018 6 3190 335035.018792 7 2970 836341.327191 8 3740 359729.808228 9 2530 324046.518720 10 4400 95745.620250 11 1210 508393.767785 12 330 370503.687331 13 2750 453664.808068 14 1980 242290.828499 15 1760 350747.617798 16 3410 410567.848126 17 990 378433.018639 18 4730 239711.708869 19 4070 322307.968330 20 3080 495458.238811 21 2090 309996.247965 22 1320 592832.067579 23 2640 308990.318559 24 1540 427777.427815 25 4840 389056.668316 26 2860 132320.519487 27 2420 406715.767402 28 770 292968.918642 29 3520 145701.079372 30 660 343594.978075 31 3630 405034.547846 32 2310 412579.388504 33 2200 361173.288199 34 1870 401070.997685",
"sql_generated": "```sql\nSELECT Store_Number, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nGROUP BY Store_Number;\n```",
},
{
"question": "What percentage of items were sold on promotion?",
"sql_result": " Promotion_Percentage 0 0.625596",
"sql_generated": "```sql\nSELECT \n (SUM(CASE WHEN On_Promo = 'Yes' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS Promotion_Percentage\nFROM \n sales;\n```",
},
{
"question": "What was the average transaction value?",
"sql_result": " Average_Transaction_Value 0 19.018132",
"sql_generated": "```sql\nSELECT AVG(Total_Sale_Value) AS Average_Transaction_Value\nFROM sales;\n```",
},
{
"question": "Create a line chart showing sales in 2021",
"sql_result": " sale_month total_quantity_sold total_sales_value 0 2021-11-01 43056.0 499984.428193 1 2021-12-01 75724.0 910982.118423",
"sql_generated": "```sql\nSELECT MONTH(Sold_Date) AS Month, SUM(Total_Sale_Value) AS Total_Sales\nFROM sales\nWHERE YEAR(Sold_Date) = 2021\nGROUP BY MONTH(Sold_Date)\nORDER BY MONTH(Sold_Date);\n```",
},
]
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one
that is vague, disorganized, or difficult to understand, even if it may be factually correct.
Your response should be a single word: either "clear" or "unclear," and it should not include any other
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and
appropriately addresses the query. "unclear" indicates that the answer is ambiguous, poorly organized, or
not effectively communicated. Please carefully consider the query and answer before determining your
response.
After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your
explanation. Your reasoning should include specific points about how the answer does or does not meet the
criteria for clarity.
[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.
EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""
ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to determine whether all
the entities mentioned in the answer are correctly identified and accurately match those in the query. An
entity refers to any specific person, place, organization, date, or other proper noun. Your evaluation
should focus on whether the entities in the answer are correctly named and appropriately associated with
the context in the query.
Your response should be a single word: either "correct" or "incorrect," and it should not include any
other text or characters. "correct" indicates that all entities mentioned in the answer match those in the
query and are properly identified. "incorrect" indicates that the answer contains errors or mismatches in
the entities referenced compared to the query.
After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "correct" or "incorrect." Avoid stating the final label at the beginning of
your explanation. Your reasoning should include specific points about how the entities in the answer do or
do not match the entities in the query.
[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.
EXPLANATION: Provide your reasoning step by step, evaluating whether the entities in the answer are
correct and consistent with the query.
LABEL: "correct" or "incorrect"
"""
TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json.dumps(tools))
def function_calling_eval(input: str, output: str) -> float:
function_calls = output.get("tool_calls")
if function_calls:
eval_df = pd.DataFrame(
{"question": [input.get("question")] * len(function_calls), "tool_call": function_calls}
)
tool_call_eval = llm_classify(
dataframe=eval_df,
template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
"{tool_definitions}", json.dumps(tools).replace("{", '"').replace("}", '"')
),
rails=["correct", "incorrect"],
model=eval_model,
provide_explanation=True,
)
tool_call_eval["score"] = tool_call_eval.apply(
lambda x: 1 if x["label"] == "correct" else 0, axis=1
)
return tool_call_eval["score"].mean()
else:
return 0
def code_is_runnable(output: str) -> bool:
"""Check if the code is runnable"""
generated_code = output.get("tool_responses")
if not generated_code:
return True
# Find first lookup_sales_data response
generated_code = next(
(r for r in generated_code if r.get("tool_name") == "generate_visualization"), None
)
if not generated_code:
return True
# Get the first response
generated_code = generated_code.get("tool_response", "")
generated_code = generated_code.strip()
generated_code = generated_code.replace("```python", "").replace("```", "")
try:
exec(generated_code)
return True
except Exception:
return False
def evaluate_sql_result(output, expected) -> bool:
sql_result = output.get("tool_responses")
if not sql_result:
return True
# Find first lookup_sales_data response
sql_result = next((r for r in sql_result if r.get("tool_name") == "lookup_sales_data"), None)
if not sql_result:
return True
# Get the first response
sql_result = sql_result.get("tool_response", "")
# Extract just the numbers from both strings
result_nums = "".join(filter(str.isdigit, sql_result))
expected_nums = "".join(filter(str.isdigit, expected.get("sql_result")))
return result_nums == expected_nums
def evaluate_clarity(output: str, input: str) -> bool:
df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
response = llm_classify(
dataframe=df,
template=CLARITY_LLM_JUDGE_PROMPT,
rails=["clear", "unclear"],
model=eval_model,
provide_explanation=True,
)
return response["label"] == "clear"
def evaluate_entity_correctness(output: str, input: str) -> bool:
df = pd.DataFrame({"query": [input.get("question")], "response": [output.get("final_output")]})
response = llm_classify(
dataframe=df,
template=ENTITY_CORRECTNESS_LLM_JUDGE_PROMPT,
rails=["correct", "incorrect"],
model=eval_model,
provide_explanation=True,
)
return response["label"] == "correct"
def run_overall_experiment(example: Example) -> str:
with suppress_tracing():
return run_agent_and_track_path_combined(example)
experiment = run_experiment(
dataset,
run_overall_experiment,
evaluators=[
function_calling_eval,
evaluate_sql_result,
evaluate_clarity,
evaluate_entity_correctness,
code_is_runnable,
],
experiment_name="Overall Experiment",
experiment_description="Evaluating the overall experiment",
)
You've now evaluated every aspect of your agent. If you've made it this far, you're now an expert in evaluating agent routers, tools, and paths!