Compare Prompt Versions

Build New Prompt Versions and Compare

Our earlier experiment revealed the limits of our current prompt and settings. Now we’ll iterate systematically: adjusting instructions, model choice, and generation hyperparameters, to test how each change impacts accuracy.

Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 3: Compare Prompt Versions.

Build Two New Prompt Versions

In Test Prompts at Scale, our experiment gave us some insights into why our prompt was underperforming - only achieving 53% accuracy. In this section, we'll build our new version of the prompt based on this analysis.

Edit Prompt Template (Version 3)

The prompt template refers to the specific text passed to your LLM. In Test Prompts at Scale, we saw that 30/71 errors came from the broad_vs_specific error type, so we built a custom instruction from this observation.

When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback).

Let's upload a new prompt version with this instruction added in.

Edit Prompt Parameters (Version 4)

In Phoenix, Prompt Objects are more than just the Prompt Template - they include other parameters that can have huge impacts on the success of your prompt. In this section, we'll upload another Prompt Version, this one with adjusted model parameters, so we can later test it out.

Here are common prompt parameters:

  • Model Choice (GPT-4.1, Claude Sonnet 4.5, Gemini 3, etc.) – Different models vary in reasoning depth, instruction-following ability, speed, and cost; selecting the right one can dramatically affect accuracy, latency, and overall cost.

  • Temperature – Lower values make responses more consistent and deterministic; higher values increase variety and creativity.

  • Top-p / Top-k – Control how many token options the model considers when generating text; useful for balancing precision and diversity.

  • Frequency / Presence Penalties – Help reduce repetition or encourage mentioning new concepts.

  • Tool Descriptions – Clearly defined tools (like web search or dataset retrieval) help the model ground its outputs and choose the right action during generation.

Let's edit our parameters.

Parameter

Current

New

Description

Model

gpt-4o-mini

gpt-4.1-mini

Slightly higher cost but improved reasoning and classification accuracy; better suited for nuanced intent detection.

Temperature

1.0

0.3

Lowering temperature makes outputs more consistent and less random—ideal for deterministic tasks like classification.

Top-p

1.0

0.8

Reduces the sampling range, encouraging the model to choose higher-probability tokens for more stable predictions.

Compare Prompt Versions

Now that we've created 2 new versions of our prompt, we need to test them on our dataset to see if our accuracy improved. This will help us figure out if our prompts improved, and what changes lead to the most improvements.

First, head to your support-classifier prompt in the Phoenix UI and copy the corresponding version IDs for Version 3 and Version 4.

Let's take a look at our results in the Experiments tab of our support query dataset.

Awesome! Our new instruction improved accuracy to 61%, and combining it with updated hyperparameters and an upgraded model (gpt-4.1-mini) pushed accuracy even higher, up to 74%.

Summary

In this section, we translated our analysis into measurable improvement. We built two new prompt versions, ran them through experiments, and quantified the gains:

  • Custom instruction only: Accuracy improved from 53% → 61%

  • Instruction + tuned parameters + upgraded model: Accuracy climbed further to 74%

By refining our prompt and adjusting key model settings, we saw clear, data-backed progress. We now have a stronger prompt, a better-performing model, and a workflow for iterating with confidence inside Phoenix.

Next Steps

We're not done yet. There's still a lot of room for improvement!

In the next section, Optimize Prompts Automatically, we'll use Prompt Learning, an automated prompt optimization algorithm (developed by Arize), to improve our prompt even more.

Last updated

Was this helpful?