Before We Start
After completing previous guides, you should have:- Traces flowing into Phoenix
- At least one evaluation defined and logged
Follow along with code: This guide has a companion codebase with runnable code examples. Find it here.
Step 1: Create a Dataset from Failed Traces
We’ll start by grouping together traces that didn’t perform well. Datasets let us collect a specific set of traces so we can analyze them together and reuse them later for testing. In this guide, we’ll create a dataset from traces that received an incomplete evaluation label. This gives us a concrete set of failures to focus on and makes it easier to test whether future changes actually fix them. You can create datasets in code, but for this walkthrough we’ll use the Phoenix UI. If you’d like to create datasets programmatically, you can follow the Create Datasets guide.Create a Dataset in the UI
- Navigate to your project in Phoenix.
-
Filter your traces by the incomplete evaluation label.
evals['completeness'].label == 'incomplete' - Select the traces you want to include.
- Click Create Dataset and give it a name.
- Add the selected traces to the dataset you just created.
Step 2: Save a Prompt from a Trace
We’ll start from a real prompt that was actually used by the application. Traces capture the exact prompts sent to the model, along with their context and outputs. Saving a prompt from a trace lets us iterate on something real, rather than starting from a blank page.Save a Prompt from the Trace View
- Navigate to your project in Phoenix.
- Open the Traces view and click into a trace.
- Find a span that contains a prompt.
- Save the prompt to the Prompt Hub.
Step 3: Run the Prompt in Prompt Playground
Next, we’ll bring that saved prompt into the Prompt Playground. The playground lets us run prompts against a dataset of inputs so we can see how a prompt behaves across many examples, not just one.Run the Prompt Against a Dataset
- Navigate to the Prompt Playground.
- Select the prompt you just saved from the Prompt Hub.
- Choose the dataset you just created.
-
Set the User prompt to
{{input}}so it uses the dataset inputs. The User prompt should be:{{input}} - Run the prompt across the dataset.
Step 4: Create and Save a New Prompt Variant
Now that we have a baseline, we can make a change. In this step, we’ll modify the prompt in the playground to address issues we saw in previous runs, such as unclear instructions or missing constraints. To understand why our evaluations came to a specific score, click into a trace and under the annotations column we can see the explanations of our evaluations. Using these explanations, we can see that runs were often labeled incomplete because the report lacked financial ratios—so we can add that into our prompt.Add a New Prompt Variant
-
Update the prompt directly in the playground. Add this line:
Make sure to include financial metrics for each ticker and use them in the analysis of the input focus.
- Run it to preview how outputs change.
- Save the new version as a separate prompt in the Prompt Hub.
Step 5: Compare Prompts Using Experiments
Once we have multiple prompt versions, we want to compare them in a structured way to see how the results differ between prompts. Since you just ran the prompt playground with both your prompts, you can see them side by side in the experiment view. In this step, we’ll navigate to the experiments page and take a look at the runs we just made by using the prompt playground in this guide.- Navigate to the Datasets page and click on the dataset we made earlier in this guide.
- You should see 3 experiment runs, the first from Step 3 and the two most recent from our new prompt comparison run.
- Click on the second one and at the top of the page, under comparison select experiment #3.

