Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Example agents are fully instrumented with OpenInference and utilize end-to-end tracing with Phoenix for comprehensive performance analysis. Enter your Phoenix and OpenAI keys to view traces.
Leverage the power of large language models to evaluate your generative model or application for hallucinations, toxicity, relevance of retrieved documents, and more.
Iteratively improve your LLM task by building datasets, running experiments, and evaluating performance using code and LLM-as-a-judge.
Use embeddings to explore lower-dimensional representations of your data, identifying clusters of high drift and performance degradation. Complement this with statistical analysis of structured data for A/B testing, temporal drift detection, and deeper performance insights.
The picture below shows a time series graph of the drift between two groups of vectors –- the primary (typically production) vectors and reference / baseline vectors. Phoenix uses euclidean distance as the primary measure of embedding drift and helps us identify times where your inference set is diverging from a given reference baseline.
Moments of high euclidean distance is an indication that the primary inference set is starting to drift from the reference inference set. As the primary inferences move further away from the reference (both in angle and in magnitude), the euclidean distance increases as well. For this reason times of high euclidean distance are a good starting point for trying to identify new anomalies and areas of drift.
In Phoenix, you can views the drift of a particular embedding in a time series graph at the top of the page. To diagnose the cause of the drift, click on the graph at different times to view a breakdown of the embeddings at particular time.
When twos are used to initialize phoenix, the clusters are automatically ordered by drift. This means that clusters that are suffering from the highest amount of under-sampling (more in the primary inferences than the reference) are bubbled to the top. You can click on these clusters to view the details of the points contained in each cluster.
In addition to the point-cloud, another dimension we have at our disposal is color (and in some cases shape). Out of the box phoenix let's you assign colors to the UMAP point-cloud by dimension (features, tags, predictions, actuals), performance (correctness which distinguishes true positives and true negatives from the incorrect predictions), and inference (to highlight areas of drift). This helps you explore your point-cloud from different perspectives depending on what you are looking for.
Trace through the execution of your LLM application to understand its internal structure and to troubleshoot issues with retrieval, tool execution, LLM calls, and more.
For each described in the inference (s), Phoenix serves a embeddings troubleshooting view to help you identify areas of drift and performance degradation. Let's start with embedding drift.
Note that when you are troubleshooting search and retrieval using inferences, the euclidean distance of your queries to your knowledge base vectors is presented as query distance.
For an in-depth guide of euclidean distance and embedding drift, check out
Phoenix automatically breaks up your embeddings into groups of inferences using a clustering algorithm called . This is particularly useful if you are trying to identify areas of your embeddings that are drifting or performing badly.
Phoenix projects the embeddings you provided into lower dimensional space (3 dimensions) using a dimension reduction algorithm called (stands for Uniform Manifold Approximation and Projection). This lets us understand how your in a visually understandable way.
Open AI Functions
Data extraction tasks using LLMs, such as scraping text from documents or pulling key information from paragraphs, are on the rise. Using an LLM for this task makes sense - LLMs are great at inherently capturing the structure of language, so extracting that structure from text using LLM prompting is a low cost, high scale method to pull out relevant data from unstructured text.
One approach is using a flattened schema. Let's say you're dealing with extracting information for a trip planning application. The query may look something like:
User: I need a budget-friendly hotel in San Francisco close to the Golden Gate Bridge for a family vacation. What do you recommend?
As the application designer, the schema you may care about here for downstream usage could be a flattened representation looking something like:
With the above extracted attributes, your downstream application can now construct a structured query to find options that might be relevant to the user.
The ChatCompletion
call to Open AI would look like
You can use phoenix spans and traces to inspect the invocation parameters of the function to
verify the inputs to the model in form of the the user message
verify your request to Open AI
verify the corresponding generated outputs from the model match what's expected from the schema and are correct
Point level evaluation is a great starting point, but verifying correctness of extraction at scale or in a batch pipeline can be challenging and expensive. Evaluating data extraction tasks performed by LLMs is inherently challenging due to factors like:
The diverse nature and format of source data.
The potential absence of a 'ground truth' for comparison.
The intricacies of context and meaning in extracted data.
OpenAI Agents SDK Cookbook
Create an agent with the OpenAI Agents SDK, trace its activity, benchmark with datasets, run experiments, and evaluate traces in production.
Evaluate an Agent
Trace and evaluate a "talk-to-your-data" agent. Includes evaluations for function calling accuracy, SQL query generation, code generation, and agent execution path.
OpenAI Agents SDK Cookbook
Evaluating an Agent
Structured extraction is a place where it’s simplest to work directly with the . Open AI functions for structured data extraction recommends providing the following JSON schema object in the form ofparameters_schema
(the desired fields for structured data output).
To learn more about how to evaluate structured extraction applications, !
Agents Cookbook
Chatbot with User Feedback
or
Embeddings Analysis: Data Exploration
RAG Use Cases
Evaluations Use Cases
Evaluating and Improving RAG Applications
Common Evaluations
Structured Data Analysis
Embeddings Analysis: Model Performance
Comprehensive Use Cases
Tracing with Sessions
Tracing Applications
Tracing Use Cases
Few-shot prompting is a powerful technique in prompt engineering that helps LLMs perform tasks more effectively by providing a few examples within the prompt.
Unlike zero-shot prompting, where the model must infer the task with no prior context, or one-shot prompting, where a single example is provided, few-shot prompting leverages multiple examples to guide the model’s responses more accurately.
In this tutorial you will:
Explore how different prompting strategies impact performance in a sentiment analysis task on a dataset of reviews.
Run an evaluation to measure how the prompt affects the model’s performance
Track your how your prompt and experiment changes overtime in Phoenix
By the end of this tutorial, you’ll have a clear understanding of how structured prompting can significantly enhance the results of any application.
⚠️You will need an OpenAI Key for this tutorial.
Let’s get started! 🚀
This dataset contains reviews along with their corresponding sentiment labels. Throughout this notebook, we will use the same dataset to evaluate the impact of different prompting techniques, refining our approach with each iteration.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
Zero-shot prompting is a technique where a language model is asked to perform a task without being given any prior examples. Instead, the model relies solely on its pre-trained knowledge to generate a response. This approach is useful when you need quick predictions without providing specific guidance.
In this section, we will apply zero-shot prompting to our sentiment analysis dataset, asking the model to classify reviews as positive, negative, or neutral without any labeled examples. We’ll then evaluate its performance to see how well it can infer the task based on the prompt alone.
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
Next we will define a task and evaluator for the experiment.
Because our dataset has ground truth labels, we can use a simple function to check if the output of the task matches the expected output.
If you’d like to instrument your code, you can run the cell below. While this step isn’t required for running prompts and evaluations, it enables trace visualization for deeper insights into the model’s behavior.
Finally, we run our experiement. We can view the results of the experiement in Phoenix.
In the following sections, we refine the prompt to enhance the model's performance and improve the evaluation results on our dataset.
One-shot prompting provides the model with a single example to guide its response. By including a labeled example in the prompt, we give the model a clearer understanding of the task, helping it generate more accurate predictions compared to zero-shot prompting.
In this section, we will apply one shot prompting to our sentiment analysis dataset by providing one labeled review as a reference. We’ll then evaluate how this small amount of guidance impacts the model’s ability to classify sentiments correctly.
Under the prompts tab in Phoenix, we can see that our prompt has an updated version. The prompt includes one random example from the test dataset to help the model make its classification.
Similar to the previous step, we will define the task and run the evaluator. This time, we will be using our updated prompt for One Shot Prompting and see how the evaluation changes.
In this run, we observe a slight improvement in the evaluation results. Let’s see if we can further enhance performance in the next section.
Note: You may sometimes see a decline in performance, which is not necessarily "wrong." Results can vary due to factors such as the choice of LLM, the randomness of selected test examples, and other inherent model behaviors.
Finally, we will explore few-shot Prompting which enhances a model’s performance by providing multiple labeled examples within the prompt. By exposing the model to several instances of the task, it gains a better understanding of the expected output, leading to more accurate and consistent responses.
In this section, we will apply few-shot prompting to our sentiment analysis dataset by including multiple labeled reviews as references. This approach helps the model recognize patterns and improves its ability to classify sentiments correctly. We’ll then evaluate its performance to see how additional examples impact accuracy compared to zero-shot and one-shot prompting.
Our updated prompt also lives in Phoenix. We can clearly see how the linear version history of our prompt was built.
Just like previous steps, we run our task and evaluation.
In this final run, we observe the most significant improvement in evaluation results. By incorporating multiple examples into our prompt, we provide clearer guidance to the model, leading to better sentiment classification.
Note: Performance may still vary, and in some cases, results might decline. Like before, this is not necessarily "wrong," as factors like the choice of LLM, the randomness of selected test examples, and inherent model behaviors can all influence outcomes.
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
From here, you can check out more , and if you haven't already, ⭐️
Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.
In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:
Upload a dataset of examples containing articles and human-written reference summaries to Phoenix
Define an experiment task that summarizes a news article
Devise evaluators for length and ROUGE score
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs
⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.
Let's get started!
Install requirements and import libraries.
Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.
Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.
A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.
You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.
Next, define a function to format a prompt template and invoke an OpenAI model on an example.
From this function, you can use functools.partial
to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.
Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:
ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.
Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.
Note that you can use any third-party library you like while defining evaluators (in your case, rouge
and tiktoken
).
Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.
Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.
Inspecting the experiment results, you'll notice that the average num_tokens
has indeed increased, but the generated summaries are still far more verbose than the reference summaries.
Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.
Now run the experiment.
By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.
⚠️ This section requires an Anthropic API key.
Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.
First, enter your Anthropic API key if it is not already present.
Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.
If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:
See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.
Congrats! In this tutorial, you have:
Created a Phoenix dataset
Defined an experimental task and custom evaluators
Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall
ReAct (Reasoning + Acting) is a prompting technique that enables LLMs to think step-by-step before taking action. Unlike traditional prompting, where a model directly provides an answer, ReAct prompts guide the model to reason through a problem first, then decide which tools or actions are necessary to reach the best solution.
ReAct is ideal for situations that require multi-step problem-solving with external tools. It also improves transparency by clearly showing the reasoning behind each tool choice, making it easier to understand and refine the model's actions.
In this tutorial, you will:
Learn how to craft prompts, tools, and evaluators in Phoenix
Refine your prompts to understand the power of ReAct prompting
Leverage Phoenix and LLM as a Judge techniques to evaluate accuracy at each step, gaining insight into the model's thought process.
Learn how to apply ReAct prompting in real-world scenarios for improved task execution and problem-solving.
⚠️ You'll need an OpenAI Key for this tutorial.
Let’s get started! 🚀
Instrument Application
This dataset contains 20 customer service questions that a customer might ask a store's chatbot. As we dive into ReAct prompting, we'll use these questions to guide the LLM in selecting the appropriate tools.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
After running this cell, the dataset should will be under the Datasets tab in Phoenix.
Next, let's define the tools available for the LLM to use. We have five tools at our disposal, each serving a specific purpose: Product Comparison, Product Details, Discounts, Customer Support, and Track Package.
Depending on the customer's question, the LLM will determine the optimal sequence of tools to use.
Let's start by defining a simple prompt that instructs the system to utilize the available tools to answer the questions. The choice of which tools to use, and how to apply them, is left to the model's discretion based on the context of each customer query.
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
This prompt is provided to the LLM-as-Judge model, which takes in both the user's query and the tools the system has selected. The model then uses reasoning to assess how effectively the chosen tools addressed the query, providing an explanation for its evaluation.
In the following cells, we will define a task for the experiment.
Then, in the evaluate_response
function, we define our LLM as a Judge evaluator. Finally, we run our experiment.
After running our experiment and evaluation, we can dive deeper into the results. By clicking into the experiment, we can explore the tools that the LLM selected for the specific input. Next, if we click on the trace for the evaluation, we can see the reasoning behind the score assigned by LLM as a Judge for the output.
Next, we iterate on our system prompt using ReAct Prompting techniques. We emphasize that the model should think through the problem step-by-step, break it down logically, and then determine which tools to use and in what order. The model is instructed to output the relevant tools along with their corresponding parameters.
This approach differs from our initial prompt because it encourages reasoning before action, guiding the model to select the best tools and parameters based on the specific context of the query, rather than simply using predefined actions.
In the Prompts tab, you will see the updated prompt. As you iterate, you can build a version history.
Just like above, we define our task, construct the evaluator, and run the experiment.
With our updated ReAct prompt, we can observe that the LLM as a Judge Evaluator rated more outputs as correct. By clicking into the traces, we can gain insights into the reasons behind this improvement. By prompting our LLM to be more thoughtful and purposeful, we can see the reasoning and acting aspects of ReAct.
You can explore the evaluators outputs to better understand the improvements in detail.
Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.
To refine and test these prompts against other datasets, experiment with alternative techniques like Chain of Thought (CoT) prompting to assess how they complement or contrast with ReAct in your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using both the TypeScript and Python Clients.
This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:
Create an agent using the OpenAI agents SDK
Trace the agent activity
Create a dataset to benchmark performance
Run an experiment to evaluate agent performance using LLM as a judge
Learn how to evaluate traces in production
Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool.
We'll use the Runner
class to run the agent and get the final output.
Now we have a basic agent, let's evaluate whether the agent responded correctly!
Agents can go awry for a variety of reasons.
Tool call accuracy - did our agent choose the right tool with the right arguments?
Tool call results - did the tool respond with the right results?
Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?
Let's setup our evaluation by defining our task function, our evaluator, and our dataset.
Next, we create our evaluator.
Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.
During development, experimentation helps iterate quickly by revealing agent failures during evaluation. You can test against datasets to refine prompts, logic, and tool usage before deploying.
In this section, we run our agent against the dataset defined above and evaluate for correctness using LLM as Judge.
With our dataset of questions we generated above, we can use our experiment feature to track changes across models, prompts, parameters for our agent.
Let's create this dataset and upload it into the platform.
In production, evaluation provides real-time insights into how agents perform on user data.
This section simulates a live production setting, showing how you can collect traces, model outputs, and evaluation results in real time.
Another option is to pull traces from completed production runs and batch process evaluations on them. You can then log the results of those evaluations in Phoenix.
After importing the necessary libraries, we set up a tracer object to enable span creation for tracing our task function.
Next, we update our correctness evaluator to return both a label and an explanation, enabling metadata to be captured during tracing.
We also revise the task function to include with
clauses that generate structured spans in Phoenix. These spans capture key details such as input values, output values, and the results of the evaluation.
Finally, we run an experiment to simulate traces in production.
Download your from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.
Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a .
As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the to train a bespoke model for your needs.
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
Next, we will define the Tool Calling Prompt Template. In this step, we use to evaluate the output. LLM as a Judge is a technique where one LLM assesses the performance of another LLM.
From here, you can check out more , and if you haven't already, ⭐️
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals .
Let's work through a Text2SQL use case where we are starting from scratch without a nice and clean dataset of questions, SQL queries, or expected responses.
Let's first start a phoenix server. Note that this is not necessary if you have a phoenix server running already.
Let's also setup tracing for OpenAI as we will be using their API to perform the synthesis.
Let's make sure we can run async code in the notebook.
Lastly, let's make sure we have our openai API key set up.
We are going to use the NBA dataset that information from 2014 - 2018. We will use DuckDB as our database.
Let's start by implementing a simple text2sql logic.
Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.
Evaluation consists of three parts — data, task, and scores. We'll start with data.
Let's store the data above as a versioned dataset in phoenix.
Next, we'll define the task. The task is to generate SQL queries from natural language questions.
Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.
Now let's run the evaluation experiment.
Ok! It looks like 3/5 of our queries are valid.
Now that we ran the initial evaluation, it looks like two of the results are valid, two produce SQL errors, and one is incorrect.
The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model (e.g. few shot example).
There are is a binder error, which may also have to do with not understanding the data format.
Let's try to improve the prompt with few-shot examples and see if we can get better results.
Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.
Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.
Sure enough the LLM agrees with our scoring. Pretty neat trick! This can come in useful when it's difficult to define a scoring function.
We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in Golden Dataset for regression testing!
Now that we have a basic flow in place, let's generate some data. We're going to use the dataset itself to generate expected queries, and have a model describe the queries. This is a slightly more robust method than having it generate queries, because we'd expect a model to describe a query more accurately than generate one from scratch.
Awesome, let's crate a dataset with the new synthetic data.
Amazing! Now we have a rich dataset to work with and some failures to debug. From here, you could try to investigate whether some of the generated data needs improvement, or try tweaking the prompt to improve accuracy, or maybe even something more adventurous, like feed the errors back to the model and have it iterate on a better query. Most importantly, we have a good workflow in place to iterate on both the application and dataset.
Just for fun, let's wrap things up by trying out GPT-3.5-turbo. All we need to do is switch the model name, and run our Eval() function again.
Interesting! It looks like the smaller model is able to do decently well but we might want to ensure it follows instructions as well as a larger model. We can actually grab all the LLM spans from our previous GPT40 runs and use them to generate a OpenAI fine-tuning JSONL file!
In this example, we walked through the process of building a dataset for a text2sql application. We started with a few handwritten examples, and iterated on the dataset by using an LLM to generate more examples. We used the eval framework to track our progress, and iterated on the model and dataset to improve the results. Finally, we tried out a less powerful model to see if we could save cost or improve latency.
Happy evaluations!
LLMs excel at text generation, but their reasoning abilities depend on how we prompt them. Chain of Thought (CoT) prompting enhances logical reasoning by guiding the model to think step by step, improving accuracy in tasks like math, logic, and multi-step problem solving.
In this tutorial, you will:
Examine how different prompting techniques influence reasoning by evaluating model performance on a dataset.
Refine prompting strategies, progressing from basic approaches to structured reasoning.
Utilize Phoenix to assess accuracy at each stage and explore the model's thought process.
Learn how to apply CoT prompting effectively in real-world tasks.
⚠️ You'll need an OpenAI Key for this tutorial.
Let’s dive in! 🚀
This dataset includes math word problems, step-by-step explanations, and their corresponding answers. As we refine our prompt, we'll test it against the dataset to measure and track improvements in performance.
Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.
Zero-shot prompting is the simplest way to interact with a language model—it involves asking a question without providing any examples or reasoning steps. The model generates an answer based solely on its pre-trained knowledge.
This serves as our baseline for comparison. By evaluating its performance on our dataset, we can see how well the model solves math word problems without explicit guidance. In later sections, we’ll introduce structured reasoning techniques like Chain of Thought (CoT) to measure improvements in accuracy and answers.
At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.
Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.
Next, we will define a task and evaluator for the experiment. Then, we run our experiment.
Because our dataset has ground truth labels, we can use a simple function to extract the answer and check if the calculated answer matches the expected output.
We can review the results of the experiment in Phoenix. We achieved ~75% accuracy in this run. In the following sections, we will iterate on this prompt and see how our evaluation changes!
Note: Throughout this tutorial, you will encounter various evaluator outcomes. At times, you may notice a decline in performance compared to the initial experiment. However, this is not necessarily a flaw. Variations in results can arise due to factors such as the choice of LLM, inherent model behaviors, and randomness.
Zero-shot prompting provides a direct answer, but it often struggles with complex reasoning. Zero-Shot Chain of Thought (CoT) prompting improves this by explicitly instructing the model to think step by step before arriving at a final answer.
By adding a simple instruction like “Let’s think through this step by step,” we encourage the model to break down the problem logically. This structured reasoning can lead to more accurate answers, especially for multi-step math problems.
In this section, we'll compare Zero-Shot CoT against our baseline to evaluate its impact on performance. First, let's create the prompt.
This updated prompt is now lives in Phoenix as a new prompt version.
Next, we run our task and evaluation by extracting the answer from the output of our LLM.
By clicking into the experiment in Phoenix, you can take a look at the steps the model took the reach the answer. By telling the model to think through the problem and output reasoning, we see a performance improvement.
Even with Chain of Thought prompting, a single response may not always be reliable. Self-Consistency CoT enhances accuracy by generating multiple reasoning paths and selecting the most common answer. Instead of relying on one response, we sample multiple outputs and aggregate them, reducing errors caused by randomness or flawed reasoning steps.
This method improves robustness, especially for complex problems where initial reasoning steps might vary. In this section, we'll compare Self-Consistency CoT to our previous promppts to see how using on multiple responses impacts overall performance.
Let's repeat the same process as above with a new prompt and evaluate the outcome.
We've observed a significant improvement in performance! Since the prompt instructs the model to compute the answer multiple times independently, you may notice that the experiment takes slightly longer to run. You can click into the experiement explore to view the independent computations the model performed for each problem.
Few-shot CoT prompting enhances reasoning by providing worked examples before asking the model to solve a new problem. By demonstrating step-by-step solutions, the model learns to apply similar logical reasoning to unseen questions.
This method leverages in-context learning, allowing the model to generalize patterns from the examples.
In this final section, we’ll compare Few-Shot CoT against our previous prompts.
First, let's construct our prompt by sampling examples from a test dataset.
We now will construct our final prompt, run the experiement, and view the results. Under the Prompts tab in Phoenix, you can track the version history of your prompt and see what random examples were chosen.
After running all of your experiments, you can compare the performance of different prompting techniques. Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.
You can review your prompt version history in the Prompts tab and explore the Playground to iterate further and run additional experiments.
To refine and test these prompts against other datasets, experiment with Chain of Thought (CoT) prompting to see its relevance to your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using the TypeScript and Python Clients.
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
From here, you can check out more , and if you haven't already, ⭐️
An LLM as a Judge refers to using an LLM as a tool for evaluating and scoring responses based on predefined criteria.
While LLMs are powerful tools for evaluation, their performance can be inconsistent. Factors like ambiguity in the prompt, biases in the model, or a lack of clear guidelines can lead to unreliable results. By fine-tuning your LLM as a Judveprompts, you can improve the model's consistency, fairness, and accuracy, ensuring it delivers more reliable evaluations.
In this tutorial, you will:
Generate an LLM as a Judge evaluation prompt and test it against a datset
Learn about various optimization techniques to improve the template, measuring accuracy at each step using Phoenix evaluations
Understand how to apply these techniques together for better evaluation across your specific use cases
In this tutorial, we will focus on creating an LLM as a Judge prompt designed to assess empathy and emotional intelligence in chatbot responses. This is especially useful for use cases like mental health chatbots or customer support interactions.
We will start by loading a dataset containing 30 chatbot responses, each with a score for empathy and emotional intelligence (out of 10). Throughout the tutorial, we’ll use our prompt to evaluate these responses and compare the output to the ground-truth labels. This will allow us to assess how well our prompt performs.
Before iterating on our template, we need to establish a prompt. Running the cell below will generate an LLM as a Judge prompt specifically for evaluating empathy and emotional intelligence. When generating this template, we emphasize:
Picking evaluation criteria (e.g., empathy, emotional support, emotional intelligence).
Defining a clear scoring system (1-10 scale with defined descriptions).
Setting response formatting guidelines for clarity and consistency.
Including an explanation for why the LLM selects a given score.
Instrument the application to send traces to Phoenix:
Now that we have our baseline prompt, we need to set up two key components:
Task: The LLM as a Judge evaluation, where the model scores chatbot responses based on empathy and emotional intelligence.
Evaluator: A function that compares the LLM as a Judge output to the ground-truth labels from our dataset
Finally, we run our experiment. With this setup, we can measure how well our prompt initially performs.
If you find that your LLM as a Judge prompt has low accuracy, we can make adjustmenets to the prompt to improve that. In this section, we explore 2 techniques for this: few shot examples and keeping a human in the loop.
Few-shot examples help improve the accuracy of an LLM as a Judge prompt by providing clear reference points for evaluation. Instead of relying solely on general instructions, the model learns from labeled examples that demonstrate correct scoring and reasoning.
By including a mix of high, medium, and low-scoring responses, we help the model:
Understand nuanced criteria like empathy and emotional intelligence.
Reduce inconsistencies by aligning with real-world judgments.
Catch edge cases and biases that the model may overlook.
Refine scoring guidelines by identifying inconsistencies in LLM outputs.
Continuously improve the prompt by analyzing where the model struggles and adjusting instructions accordingly.
However, human review can be costly and time-intensive, making full-scale annotation impractical. Fortunately, even a small number of human-labeled examples can significantly enhance accuracy.
One common bias in LLM as a Judge evaluations is favoring certain writing styles over others. For example, the model might unintentionally rate formal, structured responses higher than casual or concise ones, even if both convey the same level of empathy or intelligence.
To reduce this bias, we focus on style-invariant evaluation, ensuring that the LLM judges responses based on content rather than phrasing or tone. This can be achieved by:
Providing diverse few-shot examples that include different writing styles.
Testing for bias by evaluating responses with varied phrasing and ensuring consistent scoring.
By making evaluations style-agnostic, we create a more robust scoring system that doesn’t unintentionally penalize certain tones.
Longer prompts increase computation costs and response times, making evaluations slower and more expensive. To optimize efficiency, we focus on condensing the prompt while preserving clarity and effectiveness. This is done by:
Removing redundant instructions and simplifying wording.
Using bullet points or structured formats for concise guidance.
Eliminating unnecessary explanations while keeping critical evaluation criteria intact.
A well-optimized prompt reduces token count, leading to faster, more cost-effective evaluations without sacrificing accuracy or reliability.
Self-refinement allows a Judge to improve its own evaluations by critically analyzing and adjusting its initial judgments. Instead of providing a static score, the model engages in an iterative process:
Generate an initial score based on the evaluation criteria.
Reflect on its reasoning, checking for inconsistencies or biases.
Refine the score if needed, ensuring alignment with the evaluation guidelines.
By incorporating this style of reasoning, the model can justify its decisions and self-correct errors.
To maximize the accuracy and fairness of our Judge, we will combine multiple optimization techniques. In this example, we will incorporate few-shot examples and style-invariant evaluation to ensure the model focuses on content rather than phrasing or tone.
By applying these techniques together, we aim to create a more reliable evaluation framework.
Techniques like few-shot examples, self-refinement, style-invariant evaluation, and prompt condensation each offer unique benefits, but their effectiveness will vary depending on the task.
By systematically testing and combining these approaches, you can refine your evaluation framework.
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
Phoenix offers many for LLM as a Judge, but often, you may need to build a custom evaluator for specific use cases.
Keeping a human in the loop improves the accuracy of an LLM as a Judge by providing oversight, validation, and corrections where needed. In Phoenix, we can do this with . While LLMs can evaluate responses based on predefined criteria, human reviewers help:
This tutorial will use Phoenix to compare the performance of different prompt optimization techniques.
You'll start by creating an experiment in Phoenix that can house the results of each of your resulting prompts. Next you'll use a series of prompt optimization techniques to improve the performance of a jailbreak classification task. Each technique will be applied to the same base prompt, and the results will be compared using Phoenix.
The techniques you'll use are:
Few Shot Examples: Adding a few examples to the prompt to help the model understand the task.
Meta Prompting: Prompting a model to generate a better prompt based on previous inputs, outputs, and expected outputs.
Prompt Gradients: Using the gradient of the prompt to optimize individual components of the prompt using embeddings.
DSPy Prompt Tuning: Using DSPy, an automated prompt tuning library, to optimize the prompt.
⚠️ This tutorial requires and OpenAI API key.
Let's get started!
Since we'll be running a series of experiments, we'll need a dataset of test cases that we can run each time. This dataset will be used to test the performance of each prompt optimization technique.
Next, you can define a base template for the prompt. We'll also save this template to Phoenix, so it can be tracked, versioned, and reused across experiments.
You should now see that prompt in Phoenix:
Next you'll need a task and evaluator for the experiment. A task is a function that will be run across each example in the dataset. The task is also the piece of your code that you'll change between each run of the experiment. To start off, the task is simply a call to GPT 3.5 Turbo with a basic prompt.
You'll also need an evaluator that will be used to test the performance of the task. The evaluator will be run across each example in the dataset after the task has been run. Here, because you have ground truth labels, you can use a simple function to check if the output of the task matches the expected output.
You can also instrument your code to send all models calls to Phoenix. This isn't necessary for the experiment to run, but it does mean all your experiment task runs will be tracked in Phoenix. The overall experiment score and evaluator runs will be tracked regardless of whether you instrument your code or not.
Now you can run the initial experiment. This will be the base prompt that you'll be optimizing.
You should now see the initial experiment results in Phoenix:
One common prompt optimization technique is to use few shot examples to guide the model's behavior.
Here you can add few shot examples to the prompt to help improve performance. Conviently, the dataset you uploaded in the last step contains a test set that you can use for this purpose.
Define a new prompt that includes the few shot examples. Prompts in Phoenix are automatically versioned, so saving the prompt with the same name will create a new version that can be used.
You'll notice you now have a new version of the prompt in Phoenix:
Define a new task with your new prompt:
Now you can run another experiment with the new prompt. The dataset of test cases and the evaluator will be the same as the previous experiment.
Meta prompting involves prompting a model to generate a better prompt, based on previous inputs, outputs, and expected outputs.
The experiment from round 1 serves as a great starting point for this technique, since it has each of those components.
Now construct a new prompt that will be used to generate a new prompt.
Now save that as a prompt in Phoenix:
Redefine the task, using the new prompt.
Prompt gradient optimization is a technique that uses the gradient of the prompt to optimize individual components of the prompt using embeddings. It involves:
Converting the prompt into an embedding.
Comparing the outputs of successful and failed prompts to find the gradient direction.
Moving in the gradient direction to optimize the prompt.
Here you'll define a function to get embeddings for prompts, and then use that function to calculate the gradient direction between successful and failed prompts.
Redefine the task, using the new prompt.
DSPy makes a series of calls to optimize the prompt. It can be useful to see these calls in action. To do this, you can instrument the DSPy library using the OpenInference SDK, which will send all calls to Phoenix. This is optional, but it can be useful to have.
Now you'll setup the DSPy language model and define a prompt classification task.
Your classifier can now be used to make predictions as you would a normal LLM. It will expect a prompt
input and will output a label
prediction.
However, DSPy really shines when it comes to optimizing prompts. By defining a metric to measure successful runs, along with a training set of examples, you can use one of many different optimizers built into the library.
In this case, you'll use the MIPROv2
optimizer to find the best prompt for your task.
DSPy takes care of our prompts in this case, however you could still save the resulting prompt value in Phoenix:
Redefine the task, using the new prompt.
In the last example, you used GPT-3.5 Turbo to both run your pipeline, and optimize the prompt. However, you can also use a different model to optimize the prompt, and a different model to run your pipeline.
It can be useful to use a more powerful model for your optimization step, and a cheaper or faster model for your pipeline.
Here you'll use GPT-4o to optimize the prompt, and keep GPT-3.5 Turbo as your pipeline model.
Redefine the task, using the new prompt.
And just like that, you've run a series of prompt optimization techniques to improve the performance of a jailbreak classification task, and compared the results using Phoenix.
You should have a set of experiments that looks like this:
Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.
Finally, you can use an optimization library to optimize the prompt, like DSPy. supports each of the techniques you've used so far, and more.
From here, you can check out more , and if you haven't already, ⭐️
Building a RAG pipeline and evaluating it with Phoenix Evals.
In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.
It has the the following sections:
Understanding Retrieval Augmented Generation (RAG).
Building RAG (with the help of a framework such as LlamaIndex).
Evaluating RAG with Phoenix Evals.
LLMs are trained on vast amounts of data, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.
In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.
RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.
There are five key stages within RAG, which will in turn be a part of any larger RAG application.
Loading: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
Indexing: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
Querying: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.
During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.
For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation.
Build a QueryEngine and start querying.
Check the response that you get from the query.
By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in vector_index.as_query_engine(similarity_top_k=k)
.
Let's check the text in each of these retrieved nodes.
Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.
We can access the traces by directly pulling the spans from the phoenix session.
6aba9eee-91c9-4ee2-81e9-1bdae2eb435d
llm
LLM
NaN
NaN
cc9feb6a-30ba-4f32-af8d-8c62dd1b1b23
synthesize
CHAIN
What did the author do growing up?
NaN
8202dbe5-d17e-4939-abd8-153cad08bdca
embedding
EMBEDDING
NaN
NaN
aeadad73-485f-400b-bd9d-842abfaa460b
retrieve
RETRIEVER
What did the author do growing up?
[{'document.content': 'What I Worked On
Febru...
9e25c528-5e2f-4719-899a-8248bab290ec
query
CHAIN
What did the author do growing up?
NaN
Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.
aeadad73-485f-400b-bd9d-842abfaa460b
What did the author do growing up?
[{'document.content': 'What I Worked On
Febru...
We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.
Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.
While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.
In a RAG system, evaluation focuses on two critical aspects:
Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved
Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided.
For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.
For this tutorial, let's use Phoenix's llm_generate
to help us create the question-context pairs.
First, let's create a dataframe of all the document chunks that we have indexed.
0
What I Worked On\n\nFebruary 2021\n\nBefore co...
1
I was puzzled by the 1401. I couldn't figure o...
2
I remember vividly how impressed and envious I...
3
I couldn't have put this into words when I was...
4
This was more like it; this was what I had exp...
Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.
0
What were the two main things the author worke...
What was the language the author used to write...
What was the author's clearest memory regardin...
1
What were the limitations of the 1401 computer...
How did microcomputers change the author's exp...
Why did the author's father buy a TRS-80 compu...
2
What was the author's first experience with co...
Why did the author decide to switch from study...
What were the two things that influenced the a...
3
What were the two things that inspired the aut...
What programming language did the author learn...
What was the author's undergraduate thesis about?
4
What was the author's undergraduate thesis about?
Which three grad schools did the author apply to?
What realization did the author have during th...
The LLM has generated three questions per chunk. Let's take a quick look.
0
What I Worked On\n\nFebruary 2021\n\nBefore co...
What were the two main things the author worke...
1
I was puzzled by the 1401. I couldn't figure o...
What were the limitations of the 1401 computer...
2
I remember vividly how impressed and envious I...
What was the author's first experience with co...
3
I couldn't have put this into words when I was...
What were the two things that inspired the aut...
4
This was more like it; this was what I had exp...
What was the author's undergraduate thesis about?
5
Only Harvard accepted me, so that was where I ...
What realization did the author have during th...
6
So I decided to focus on Lisp. In fact, I deci...
What motivated the author to write a book abou...
7
Anyone who wanted one to play around with coul...
What realization did the author have while vis...
8
I knew intellectually that people made art — t...
What was the author's initial perception of pe...
9
Then one day in April 1990 a crack appeared in...
What was the author's initial plan for their d...
We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.
context.span_id
document_position
b375be95-8e5e-4817-a29f-e18f7aaa3e98
0
20e0f915-e089-4e8e-8314-b68ffdffd7d1
How does leaving YC affect the author's relati...
On one of them I realized I was ready to hand ...
0.820411
1
20e0f915-e089-4e8e-8314-b68ffdffd7d1
How does leaving YC affect the author's relati...
That was what it took for Rtm to offer unsolic...
0.815969
e4e68b51-dbc9-4154-85a4-5cc69382050d
0
4ad14fd2-0950-4b3f-9613-e1be5e51b5a4
Why did YC become a fund for a couple of years...
For example, one thing Julian had done for us ...
0.860981
1
4ad14fd2-0950-4b3f-9613-e1be5e51b5a4
Why did YC become a fund for a couple of years...
They were an impressive group. That first batc...
0.849695
27ba6b6f-828b-4732-bfcc-3262775cd71f
0
d62fb8e8-4247-40ac-8808-818861bfb059
Why did the author choose the name 'Y Combinat...
Screw the VCs who were taking so long to make ...
0.868981
...
...
...
...
...
...
353f152c-44ce-4f3e-a323-0caa90f4c078
1
6b7bebf6-bed3-45fd-828a-0730d8f358ba
What was the author's first experience with co...
What I Worked On\n\nFebruary 2021\n\nBefore co...
0.877719
16de2060-dd9b-4622-92a1-9be080564a40
0
6ce5800d-7186-414e-a1cf-1efb8d39c8d4
What were the limitations of the 1401 computer...
I was puzzled by the 1401. I couldn't figure o...
0.847688
1
6ce5800d-7186-414e-a1cf-1efb8d39c8d4
What were the limitations of the 1401 computer...
I remember vividly how impressed and envious I...
0.836979
e996c90f-4ea9-4f7c-b145-cf461de7d09b
0
a328a85a-aadd-44f5-b49a-2748d0bd4d2f
What were the two main things the author worke...
What I Worked On\n\nFebruary 2021\n\nBefore co...
0.843280
1
a328a85a-aadd-44f5-b49a-2748d0bd4d2f
What were the two main things the author worke...
Then one day in April 1990 a crack appeared in...
0.822055
Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on explanations
which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.
We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.
Let's also compute precision at 2 for all our retrieval steps.
Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)
Let's now view the results in a combined dataframe.
Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.
As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.
We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.
The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.
Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.
Our QA Correctness score of 0.91
and a Hallucinations score 0.05
signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.
Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.
We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.
We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.
Now that we have understood the stages of RAG, let's build a pipeline. We will use for RAG and for evaluation.
Let's use an to build our RAG pipeline.
Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the ).
Let's compute Normalized Discounted Cumulative Gain at 2 for all our retrieval steps. In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.
Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the .
Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the documentation.
This notebook serves as an end-to-end example of how to trace and evaluate an agent. The example uses a "talk-to-your-data" agent as its example.
The notebook shows examples of:
Manually instrumenting an agent using Phoenix decorators
Evaluating function calling accuracy using LLM as a Judge
Evaluating function calling accuracy by comparing to ground truth
Evaluating SQL query generation
Evaluating Python code generation
Evaluating the path of an agent
Your agent will interact with a local database. Start by loading in that data:
Now you can define your agent tools.
You'll need to pass your tool descriptions into your agent router. The following code allows you to easily do so:
With the tools defined, you're ready to define the main routing and tool call handling steps of your agent.
Your agent is now good to go! Let's try it out with some example questions:
So your agent looks like it's working, but how can you measure its performance?
This first evaluation will evaluate your agent router choices using another LLM.
It follows a standard pattern:
Export traces from Phoenix
Prepare those exported traces in a dataframe with the correct columns
Use llm_classify
to run a standard template across each row of that dataframe and produce an eval label
Upload the results back into Phoenix
You should now see eval labels in Phoenix.
The above example works, however if you have ground truth labled data, you can use that data to get an even more accurate measure of your router's performance by running an experiments.
Experiments also follow a standard step-by-step process in Phoenix:
Create a dataset of test cases, and optionally, expected outputs
Create a task to run on each test case - usually this is invoking your agent or a specifc step of it
Create evaluator(s) to run on each output of your task
Visualize results in Phoenix
For your task, you can simply run just the router call of your agent:
Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:
In this case, you don't have ground truth data to compare to. Instead you can just use a simple code evaluator: trying to run the generated code and catching any errors.
Finally, the last piece of your agent to evaluate is its path. This is important to evaluate to understand how efficient your agent is in its execution. Does it need to call the same tool multiple times? Does it skip steps it shouldn't, and have to backtrack later? Convergence or path evals can tell you this.
Convergence evals operate slightly differently. The one you'll use below relies on knowing the minimum number of steps taken by the agent for a given type of query. Instead of just running an experiment, you'll run an experiment then after it completes, attach a second evaluator to calculate convergence.
The workflow is as follows:
Create a dataset of the same type of question, phrased different ways each time - the agent should take the same path for each, but you'll often find it doesn't.
Create a task that runs the agent on each question, while tracking the number of steps it takes.
Run the experiment without an evaluator.
Calculate the minimum number of steps taken to complete the task.
Create an evaluator that compares the steps taken of each run against that min step number.
Run this evaluator on your experiment from step 3.
View your results in Phoenix
As an optional final step, you can combine all the evaluators and experiments above into a single experiment. This requires some more advanced data wrangling, but gives you a single report on your agent's performance.
You've now evaluated every aspect of your agent. If you've made it this far, you're now an expert in evaluating agent routers, tools, and paths!
Sign up for a free instance of to get your API key. If you'd prefer, you can instead .
The next piece of your agent to evaluate is its tools. Each tool is usually evaluated differently - we've included some examples below. If you need other ideas, give you an idea of other metrics to use.