Prompt Optimization via UI
Hand engineering prompts is brittle—small changes can break behavior and manual iteration doesn’t scale. With Prompt Optimization tasks, you can optimize prompts in a few clicks using human or automated feedback loops and versioned releases. This brings prompt engineering into a reproducible, CI-friendly workflow instead of trial-and-error. Each auto-generated version of the prompt is committed to the Prompt Hub, so you can A/B test different versions and safely promote the winner.
Key Features
Auto-generate the best prompt from your labeled dataset
Promote the best auto-generated prompt in the prompt hub as the production version
Evaluate the auto-generated prompt against the original using a side-by-side comparison on the experiments page
Quick setup
Prompt optimization uses a similar task builder as the one used for Online Evals, so you can get started quickly. Pick the prompt you want to optimize from Prompt Hub. If you don’t have one yet, click Create New Prompt to add it. Then, choose a Training Dataset and set a Batch Size (defaults to 10). Finally, select one or more Feedback Columns that contain evaluation signals, which can be labels from human annotators or LLM-as-a-Judge evaluators, plus optional explanations.


Here's what you'll need to set up to trigger the task, as you click through.
Task Name
Name the prompt optimization task.
Prompt To Optimize
Load in a prompt from your Prompt Hub. This is the prompt that will be optimized. Optimized versions will be automatically added to the prompt object through version control. If your prompt uses any columns from your data, make sure to include those column names in curly braces. For example, the system prompt could be:
You are a customer support agent. nswer the user's question: {input}. The user has ordered the following item: {item_name}.
Dataset
Your training data that the optimization will use, pulled from your AX Datasets.
Your training data should have
Input columns - Inputs that are passed into the system prompt. For our example system prompt, you would need columns input
and item_name
.
Output column - Output from LLM application that used the system prompt.
Eval columns - Feedback channels for the prompt optimizer. This is additional, row level information that serves as directional feedback, to help the optimizer make good decisions.
In practice, we've seen the following types of evaluation to result in best optimization results:
Was the output correct/incorrect? (either through some determinstic output testing, or LLM-as-Judge)
Why was the output correct/incorrect? (LLM-as-Judge)
Given the system prompt {system_prompt}, why do you think the LLM made the correct/incorrect choice?
"Where's my package, it’s been weeks."
Brown/Black Reversible Belt
Please wait a few more days; delays happen.
❌ Misclassified (should escalate to shipping)
❌ Dismissive
❌ Unresolved
“The shirt I ordered is too small. Can I exchange it?”
Blue T Shirt
Sure, I can help you process an exchange for a larger size.
✅ Correct
✅ Polite
✅ Resolved
“charged twice for the same item.”
Black Corduroy Jeans
You can check your billing statement for confirmation.
❌ Wrong action (should refund or escalate)
⚠️ Neutral
❌ Not resolved
“website keeps logging me out before checkout”
No Item Ordered
That sounds frustrating! Please clear your cookies and try again.
✅ Correct
✅ Empathetic
⚠️ Partial fix
“Customer support promised me refund!"
Brown Leather Jacket
Refunds may take a few weeks to process.
⚠️ Technically correct
❌ Cold
❌ Unresolved
Training Batch Size
You can set a batch size, which determines how many data rows are processed in each optimization round. For example, if you have 50 rows and set the batch size to 25, the optimization will run in 2 batches:
Batch 1: Version 2 is generated by optimizing Version 1 on the first 25 rows.
Batch 2: Version 3 is then generated by optimizing Version 2 on the next 25 rows.
Each batch builds upon the previous version, producing progressively refined prompts across iterations.
Output Column
Name of the training dataset column that contains your output.
Feedback Columns
Names of training dataset columns that contain feedback/evals.
Meta Prompt
Prompt Optimization uses Meta Prompting to generate new prompts - feeding in the unoptimized system prompt and training data into an LLM.
You can configure which model is used for meta prompting, as well as its parameters.
You can also customize the meta prompt itself. Make sure to keep {baseline_prompt}, {examples}, and the final note - otherwise the training data ingestion will fail.
Track optimization process
The Task Logs page shows each batch as it runs, including what prompt was used, which examples were evaluated, and what feedback was generated.
After each batch, the optimizer proposes a new prompt candidate, which is automatically saved as a new version in the Prompt Hub.
To review changes, go to the prompt’s page in Prompt Hub and review the list of Versions.

Compare the optimized prompt against the original
From Prompt Hub, you can compare the final optimized prompt to the original by launching an experiment in the Playground.
Select an evaluation dataset, choose the two prompt versions, and review metrics side-by-side, including both high-level summary metrics and example-level outputs.
Coming soon: Prompt Optimization tasks will optionally trigger this experiment automatically.
Deploy with Human-in-the-Loop Oversight
Once you’ve reviewed the results, tag the winning prompt version as Production in Prompt Hub.
The Prompt Hub SDK will automatically pull the latest version with the "production"
tag at inference time, ensuring a human-in-the-loop that enables you to ship with confidence.=
Last updated
Was this helpful?