Compare Prompt Versions
Build New Prompt Versions and Compare
Our earlier experiment revealed the limits of our current prompt and settings. Now we’ll iterate systematically: adjusting instructions, model choice, and generation hyperparameters, to test how each change impacts accuracy.
Build Two New Prompt Versions
In Test Prompts at Scale, our experiment gave us some insights into why our prompt was underperforming - only achieving 53% accuracy. In this section, we'll build our new version of the prompt based on this analysis.
Edit Prompt Template (Version 3)
The prompt template refers to the specific text passed to your LLM. In Test Prompts at Scale, we saw that 30/71 errors came from the broad_vs_specific error type, so we built a custom instruction from this observation.
When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback).Let's upload a new prompt version with this instruction added in.
from phoenix.client import Client
from phoenix.client.types.prompts import PromptVersion
px_client = Client()
# 1. New Instruction
broad_vs_specific_instruction = """When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback)."""
# 2. Get existing prompt
existing = px_client.prompts.get(prompt_identifier="support-classifier")
# 3. Modify the template
messages = existing._template["messages"]
# Add new instruction to system prompt
messages[0]["content"][0]["text"] += broad_vs_specific_instruction
# 4. Create new version with modifications
new_version = PromptVersion(
messages,
model_name=existing._model_name,
model_provider=existing._model_provider,
template_format=existing._template_format,
description="Added broad_vs_specific rule"
)
# 5. Save as new version
created = px_client.prompts.create(
name="support-classifier", # Same name = new version on existing prompt
version=new_version,
)Edit Prompt Parameters (Version 4)
In Phoenix, Prompt Objects are more than just the Prompt Template - they include other parameters that can have huge impacts on the success of your prompt. In this section, we'll upload another Prompt Version, this one with adjusted model parameters, so we can later test it out.
Here are common prompt parameters:
Model Choice (GPT-4.1, Claude Sonnet 4.5, Gemini 3, etc.) – Different models vary in reasoning depth, instruction-following ability, speed, and cost; selecting the right one can dramatically affect accuracy, latency, and overall cost.
Temperature – Lower values make responses more consistent and deterministic; higher values increase variety and creativity.
Top-p / Top-k – Control how many token options the model considers when generating text; useful for balancing precision and diversity.
Frequency / Presence Penalties – Help reduce repetition or encourage mentioning new concepts.
Tool Descriptions – Clearly defined tools (like web search or dataset retrieval) help the model ground its outputs and choose the right action during generation.
Let's edit our parameters.
Parameter
Current
New
Description
Model
gpt-4o-mini
gpt-4.1-mini
Slightly higher cost but improved reasoning and classification accuracy; better suited for nuanced intent detection.
Temperature
1.0
0.3
Lowering temperature makes outputs more consistent and less random—ideal for deterministic tasks like classification.
Top-p
1.0
0.8
Reduces the sampling range, encouraging the model to choose higher-probability tokens for more stable predictions.
Compare Prompt Versions
Now that we've created 2 new versions of our prompt, we need to test them on our dataset to see if our accuracy improved. This will help us figure out if our prompts improved, and what changes lead to the most improvements.
First, head to your support-classifier prompt in the Phoenix UI and copy the corresponding version IDs for Version 3 and Version 4.
Let's take a look at our results in the Experiments tab of our support query dataset.
Awesome! Our new instruction improved accuracy to 61%, and combining it with updated hyperparameters and an upgraded model (gpt-4.1-mini) pushed accuracy even higher, up to 74%.
Summary
In this section, we translated our analysis into measurable improvement. We built two new prompt versions, ran them through experiments, and quantified the gains:
Custom instruction only: Accuracy improved from 53% → 61%
Instruction + tuned parameters + upgraded model: Accuracy climbed further to 74%
By refining our prompt and adjusting key model settings, we saw clear, data-backed progress. We now have a stronger prompt, a better-performing model, and a workflow for iterating with confidence inside Phoenix.
Next Steps
We're not done yet. There's still a lot of room for improvement!
In the next section, Optimize Prompts Automatically, we'll use Prompt Learning, an automated prompt optimization algorithm (developed by Arize), to improve our prompt even more.
Last updated
Was this helpful?