Test Prompts at Scale

Measure and Edit Prompts at Scale

To truly improve a prompt, you first need visibility - and visibility comes from data. In Part 1, we identified a misclassification in our traces and edited our prompt to fix it. But validating on a single example isn't enough. A single trace can show you one mistake, but only a dataset can show you the pattern behind many.

Part 2 of this walkthrough focuses on using Phoenix to:

  1. Run our current prompt across a dataset of inputs

  2. Compute metrics to measure prompt performance

  3. Generate natural-language feedback to guide improvements

  4. Edit and retest prompts to build confidence in our changes

  5. Save and manage our prompt versions in Prompt Hub

Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 2: Test Prompts at Scale.

Step 1: Load Dataset of Inputs

Let's upload a dataset of support queries and run our new classification prompt against all of them. This lets us measure performance systematically before deploying to production.

  1. Download support_queries.csv here.

  2. Navigate to Datasets and Experiments, click Create Dataset, and upload.

  3. Select query for Input keys, as this is our input column.

  4. Select ground_truth for Output keys, as this is our ground truth output.

  5. Click Create Dataset and navigate to your new dataset.

Step 2: Run Experiment with Our Current Prompt

With a dataset in place, the next step is to measure how our prompt performs across many examples. This gives us a clear baseline for accuracy and helps surface the common failure patterns we’ll address next.

Define Task Function

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.

Define Evaluators

Running the model gives us raw predictions, but that alone doesn’t tell us much. Evaluators help turn those predictions into meaningful feedback by scoring performance and explaining why the model was right or wrong. This gives us a clearer picture of how our prompt is actually performing.

In this example, we’ll use two evaluators:

  • ground_truth_evaluator – Verifies whether the model’s predicted classification matches the ground truth.

  • output_evaluator – Uses an LLM to provide a richer, qualitative analysis of each classification, including:

    • explanation – Why the classification is correct or incorrect.

    • confusion_reason – If incorrect, why the model might have made the wrong choice.

    • error_type – If incorrect, what kind of error occurred (broad_vs_specific, keyword_bias, multi_intent_confusion, ambiguous_query, off_topic, paraphrase_gap, or other).

    • evidence_span – The exact phrase in the query that supports the correct classification.

    • prompt_fix_suggestion – A clear instruction you could add to the classifier prompt to prevent this kind of error in the future.

See the full eval prompt we use for analysis_evaluator in the Define Evaluators section in the notebook.

By leveraging the reasoning abilities of LLMs, we can automatically annotate our failure cases with rich diagnostic information—helping us identify weaknesses and iteratively improve our prompt.

Run Experiment

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.

Your stdout should look like

Step 3: Analyze Experiment Results

After collecting our outputs and evaluation results, the next step is to interpret them. This analysis helps us see where the prompt performs well, where it fails, and which types of errors occur most often - insights we can use to guide our next round of improvements.

After running the experiment in code, it will show up in the Phoenix UI on the Datasets and Experiments page and under our support query dataset.

We see that our ground_truth_evaluator gave us a score of 0.53. This means that 53% of our LLM classifications correctly matched the ground truth, leaving lots of room for improvement!

But we don't just have that scalar score - we also have rich, natural language feedback that we generated from our LLM. This helps guide us into writing better prompts, based on our data!

You can filter for all rows that had incorrect classifications with the following query:

Now, hover over the output_evaluator to see the natural language feedback we generated. Here's one that stood out:

It seems we're hitting that same broad vs specific issue that we corrected for integration help/technical bug report in Part 1. Let's filter for all rows with broad_vs_specific error type.

Seems like we have a lot (30) of broad_vs_specific error types. This makes up, by far, the largest plurality of our errors. Note that without our LLM evaluator, it would have been really hard, and much more time consuming, to figure this out!

Let's add a specific instruction to our prompt to address broad_vs_specific errors.

Summary

Congratulations! You’ve successfully validated your prompt at scale-running real experiments, collecting quantitative and qualitative feedback, and uncovering exactly where and why your model fails.

You used an LLM evaluator to analyze your application at scale, instead of manually reading every singe input/output pair.

Next Steps

In Part 3, we’ll enhance our prompt by adding the new instruction and adjusting key model parameters, such as model choice, temperature, top_p. Then, we’ll rerun experiments using the updated prompt and directly compare the results with our previous version. You’ll learn how to use Phoenix to experiment with and evaluate multiple prompt versions side by side-helping you identify which performs best.

Last updated

Was this helpful?