Prompt Optimization via UI

Hand engineering prompts is brittle—small changes can break behavior and manual iteration doesn’t scale. With Prompt Optimization tasks, you can optimize prompts in a few clicks using human or automated feedback loops and versioned releases. This brings prompt engineering into a reproducible, CI-friendly workflow instead of trial-and-error. Each auto-generated version of the prompt is committed to the Prompt Hub, so you can A/B test different versions and safely promote the winner.

Key Features

  • Auto-generate the best prompt from your labeled dataset

  • Promote the best auto-generated prompt in the prompt hub as the production version

  • Evaluate the auto-generated prompt against the original using a side-by-side comparison on the experiments page

Quick setup

Prompt optimization uses a similar task builder as the one used for Online Evals, so you can get started quickly. Pick the prompt you want to optimize from Prompt Hub. If you don’t have one yet, click Create New Prompt to add it. Then, choose a Training Dataset and set a Batch Size (defaults to 10). Finally, select one or more Feedback Columns that contain evaluation signals, which can be labels from human annotators or LLM-as-a-Judge evaluators, plus optional explanations.

From the Evals & Tasks page, click "+ New Task" and select "Prompt Optimization".
Select a prompt from the prompt hub, then choose a training dataset that contains LLM input, output and feedback labels to guide the optimization process.

Here's what you'll need to set up to trigger the task, as you click through.

Task Name

Name the prompt optimization task.

Prompt To Optimize

Load in a prompt from your Prompt Hub. This is the prompt that will be optimized. Optimized versions will be automatically added to the prompt object through version control. If your prompt uses any columns from your data, make sure to include those column names in curly braces. For example, the system prompt could be:

You are a customer support agent. nswer the user's question: {input}. The user has ordered the following item: {item_name}.

Dataset

Your training data that the optimization will use, pulled from your AX Datasets.

Your training data should have

Input columns - Inputs that are passed into the system prompt. For our example system prompt, you would need columns input and item_name.

Output column - Output from LLM application that used the system prompt.

Eval columns - Feedback channels for the prompt optimizer. This is additional, row level information that serves as directional feedback, to help the optimizer make good decisions.

In practice, we've seen the following types of evaluation to result in best optimization results:

  • Was the output correct/incorrect? (either through some determinstic output testing, or LLM-as-Judge)

  • Why was the output correct/incorrect? (LLM-as-Judge)

  • Given the system prompt {system_prompt}, why do you think the LLM made the correct/incorrect choice?

input
item_name
output
eval #1 (Correct Classification)
eval #2 (Tone)
eval #3 (Resolution Quality)

"Where's my package, it’s been weeks."

Brown/Black Reversible Belt

Please wait a few more days; delays happen.

❌ Misclassified (should escalate to shipping)

❌ Dismissive

❌ Unresolved

“The shirt I ordered is too small. Can I exchange it?”

Blue T Shirt

Sure, I can help you process an exchange for a larger size.

✅ Correct

✅ Polite

✅ Resolved

“charged twice for the same item.”

Black Corduroy Jeans

You can check your billing statement for confirmation.

❌ Wrong action (should refund or escalate)

⚠️ Neutral

❌ Not resolved

“website keeps logging me out before checkout”

No Item Ordered

That sounds frustrating! Please clear your cookies and try again.

✅ Correct

✅ Empathetic

⚠️ Partial fix

“Customer support promised me refund!"

Brown Leather Jacket

Refunds may take a few weeks to process.

⚠️ Technically correct

❌ Cold

❌ Unresolved

Training Batch Size

You can set a batch size, which determines how many data rows are processed in each optimization round. For example, if you have 50 rows and set the batch size to 25, the optimization will run in 2 batches:

  • Batch 1: Version 2 is generated by optimizing Version 1 on the first 25 rows.

  • Batch 2: Version 3 is then generated by optimizing Version 2 on the next 25 rows.

Each batch builds upon the previous version, producing progressively refined prompts across iterations.

Output Column

Name of the training dataset column that contains your output.

Feedback Columns

Names of training dataset columns that contain feedback/evals.

Meta Prompt

Prompt Optimization uses Meta Prompting to generate new prompts - feeding in the unoptimized system prompt and training data into an LLM.

You can configure which model is used for meta prompting, as well as its parameters.

You can also customize the meta prompt itself. Make sure to keep {baseline_prompt}, {examples}, and the final note - otherwise the training data ingestion will fail.

Track optimization process

The Task Logs page shows each batch as it runs, including what prompt was used, which examples were evaluated, and what feedback was generated.

After each batch, the optimizer proposes a new prompt candidate, which is automatically saved as a new version in the Prompt Hub.

To review changes, go to the prompt’s page in Prompt Hub and review the list of Versions.

Final prompt after two prompt optimization iterations.

Compare the optimized prompt against the original

From Prompt Hub, you can compare the final optimized prompt to the original by launching an experiment in the Playground.

Select an evaluation dataset, choose the two prompt versions, and review metrics side-by-side, including both high-level summary metrics and example-level outputs.

Coming soon: Prompt Optimization tasks will optionally trigger this experiment automatically.

Deploy with Human-in-the-Loop Oversight

Once you’ve reviewed the results, tag the winning prompt version as Production in Prompt Hub.

The Prompt Hub SDK will automatically pull the latest version with the "production" tag at inference time, ensuring a human-in-the-loop that enables you to ship with confidence.=

Last updated

Was this helpful?