Skip to main content
A prompt is the set of instructions and context sent to the model to produce an output. In this guide, you’ll start from real prompts captured during executions, group failing runs into a dataset, and use the Prompt Playground to iterate on prompt variants while measuring how those changes affect application quality. Prompt Hub is used to save and reuse prompts across runs. Up to this point, we’ve traced our agent runs and evaluated their outputs. Now we’ll focus on prompts by grouping failures into a dataset and iterating on prompt variants in the Prompt Playground.

Before We Start

After completing previous guides, you should have:
  • Traces flowing into Phoenix
  • At least one evaluation defined and logged

Follow along with code: This guide has a companion codebase with runnable code examples. Find it here.

Step 1: Create a Dataset from Failed Traces

We’ll start by grouping together traces that didn’t perform well. Datasets let us collect a specific set of traces so we can analyze them together and reuse them later for testing. In this guide, we’ll create a dataset from traces that received an incomplete evaluation label. This gives us a concrete set of failures to focus on and makes it easier to test whether future changes actually fix them. You can create datasets in code, but for this walkthrough we’ll use the Phoenix UI. If you’d like to create datasets programmatically, you can follow the Create Datasets guide.

Create a Dataset in the UI

  1. Navigate to your project in Phoenix.
  2. Filter your traces by the incomplete evaluation label. evals['completeness'].label == 'incomplete'
  3. Select the traces you want to include.
  4. Click Create Dataset and give it a name.
  5. Add the selected traces to the dataset you just created.
This dataset now represents a concrete failure case for your application.

Step 2: Save a Prompt from a Trace

We’ll start from a real prompt that was actually used by the application. Traces capture the exact prompts sent to the model, along with their context and outputs. Saving a prompt from a trace lets us iterate on something real, rather than starting from a blank page.

Save a Prompt from the Trace View

  1. Navigate to your project in Phoenix.
  2. Open the Traces view and click into a trace.
  3. Find a span that contains a prompt.
  4. Save the prompt to the Prompt Hub.
This gives us a concrete starting point for prompt iteration.

Step 3: Run the Prompt in Prompt Playground

Next, we’ll bring that saved prompt into the Prompt Playground. The playground lets us run prompts against a dataset of inputs so we can see how a prompt behaves across many examples, not just one.

Run the Prompt Against a Dataset

  1. Navigate to the Prompt Playground.
  2. Select the prompt you just saved from the Prompt Hub.
  3. Choose the dataset you just created.
  4. Set the User prompt to {{input}} so it uses the dataset inputs. The User prompt should be: {{input}}
  5. Run the prompt across the dataset.
This gives us a baseline for how the current prompt performs.

Step 4: Create and Save a New Prompt Variant

Now that we have a baseline, we can make a change. In this step, we’ll modify the prompt in the playground to address issues we saw in previous runs, such as unclear instructions or missing constraints. To understand why our evaluations came to a specific score, click into a trace and under the annotations column we can see the explanations of our evaluations. Using these explanations, we can see that runs were often labeled incomplete because the report lacked financial ratios—so we can add that into our prompt.

Add a New Prompt Variant

  1. Update the prompt directly in the playground. Add this line:
    Make sure to include financial metrics for each ticker and use them in the analysis of the input focus.
  2. Run it to preview how outputs change.
  3. Save the new version as a separate prompt in the Prompt Hub.
Saving prompt variants makes it easy to track changes and compare different approaches over time.

Step 5: Compare Prompts Using Experiments

Once we have multiple prompt versions, we want to compare them in a structured way to see how the results differ between prompts. Since you just ran the prompt playground with both your prompts, you can see them side by side in the experiment view. In this step, we’ll navigate to the experiments page and take a look at the runs we just made by using the prompt playground in this guide.
  1. Navigate to the Datasets page and click on the dataset we made earlier in this guide.
  2. You should see 3 experiment runs, the first from Step 3 and the two most recent from our new prompt comparison run.
  3. Click on the second one and at the top of the page, under comparison select experiment #3.
Now you can see the two prompts we just ran in the Prompt Playground side by side.
Congratulations! You’ve iterated on prompts and run an A/B test to see the effects of your prompt!

Learn More About Prompts

Now that you’ve iterated on a prompt, you can start incorporating prompt iteration directly into your development workflow. To learn more about testing different changes to your system at once and seeing the results, the Experiments guide tells you how to take this iteration to the next level. You can use the Prompt Playground to test prompt changes across different datasets, compare prompt variants, and see how small changes affect outputs at scale. Saving prompts to the Prompt Hub helps keep track of versions and reuse prompts across experiments. The Prompt Playground and Prompt Hub guides go deeper into these workflows and show how to apply them as your application evolves.