Optimizing Prompts for LLM Classification - Prompt Learning

Using Prompt Learning to boost accuracy on a classification dataset

Skip to Prompt Learning for Classification if you want to see the notebook.

What is Prompt Learning?

Prompt Learning is an algorithm developed by Arize to optimize prompts based on data.

See our detailed blog on Prompt Learning, and/or a quick summary of the algorithm below.

The pipeline, which uses Phoenix extensively, works as follows:

Upload a dataset of inputs/queries to Phoenix
Run a Phoenix experiment on the dataset with your unoptimized, base prompt
Build LLM evals with Phoenix or human annotations to return natural language feedback
- e.g. explanations -> why this output was correct/incorrect (most powerful)
- e.g. confusion reason -> why the model may have been confused
- e.g. improvement suggestions -> where the prompt should be improved based on this input/output pair
Use meta-prompting to optimize the original prompt
- feed prompt + inputs + outputs + evals + annotations to another LLM
- ask it to generate an optimized prompt!
Run and evaluate new, optimized prompt with another Phoenix experiment

Prompt Learning for Classification

In this cookbook we use Prompt Learning to improve accuracy of GPT-4o-mini on classification of support queries.

To view and run the notebook, first clone the Prompt Learning repository.

git clone https://github.com/Arize-ai/prompt-learning.git

Navigate to notebooks -> phoenix_support_query_classification.ipynb.

You can see the notebook here. But keep in mind you will have to clone the repository and run the notebook within the notebooks folder for the notebook to run!

Example Support Queries (our Dataset)

Our dataset contains 154 synthetically generated support queries, each mapping to one of the 30 classes (also synthetically generated).

Signed up ages ago but never got around to logging in — now it says no account found. Do I start over?,Account Creation
where’s the night theme u promised, Feature Request
google calendar link keeps ‘erroring’, Integration Help

About one third of the queries/classes are chosen to be inherently ambiguous and not straightforward for GPT-4o-mini to solve, as we want to show progression of accuracy through Prompt Learning. For example:

order say on the truck 2 days now lol, Shipping Delay
my cc on file died, how fix", Payment Method Update

Base Prompt

Below is the base, unoptimized prompt we start off with.

You are given a support query:
support query: {query}

Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback

Classify the query into one of the categories.
Return just the category, no other text.

Evaluator

In Prompt Learning, your evals/annotations really make or break the optimization. Good evals allow the meta prompt LLM to figure out what changes/improvements are needed to optimize the prompt. Bad evals, such as just correct/incorrect labels, don't actually guide the meta prompt LLM to making effective prompt updates.

We build a complex evaluator as feedback for the prompt optimizer. Specifically, we use LLM-as-judge to return the following eval types:

correctness: "correct" or "incorrect" based on whether predicted classification = correct classification
explanation: Brief explanation of why the predicted classification is correct or incorrect, referencing the correct label if relevant.
confusion_reason: If incorrect, explain why the model may have made this choice instead of the correct classification. Focus on likely sources of confusion. If correct, say 'no confusion'.
evidence_span: Exact phrase(s) from the query that strongly indicate the correct classification.
prompt_fix_suggestion: One clear instruction to add to the classifier prompt to prevent this error.

Results - Accuracy

Here we see strong results in one just loop of optimization, and stronger results in 5 loops. This is characteristic of Prompt Learning -> its both data efficient and epoch efficient, allowing you to achieve strong results quickly!

Results - New Prompt

In our optimized prompt, Prompt Learning added:

descriptions for the classes
general rules (which generalize outside of the provided dataset)
common decision pivots
few shot guidance
much better prompt to the human eye

You are a customer-support ticket classifier.

INPUT
support query: {query}

TASK
Read the entire message, identify the user’s single primary intent, and output the one best-matching category name from the list below.
Return ONLY that name—no other words, numbers, or punctuation.

GENERAL RULES
1. Output must be one (and only one) name that appears verbatim in the Category List. Never invent or shorten names.
2. Choose the most specific class that solves the user’s main problem; prefer the child over its parent (pick 2FA over login issues if support query matches 2FA)
3. Use full-message meaning, not isolated keywords. If words conflict with context, trust the context.
4. When several issues are mentioned, pick the one the user wants fixed first (usually the obstacle blocking them now).
5. If intent is unclear after careful reading, pick the most probable class—not “General Feedback”.
6. Slang, typos, emojis, or missing words still map to their standard meaning.

COMMON DECISION PIVOTS
• RETURN / EXCHANGE
  – Item already received and user asks about sending it back, labels, packaging, or status of a sent-back item → Product Return.
  – User hasn’t returned anything yet but wants money back → Refund Request.

• CHARGES & MONEY
  – Duplicate, wrong, or unclear charges → Billing Inquiry.
  – Wants an invoice copy, correction, or number → Invoice Request (even if charge seems wrong).
  – Needs to add/remove/change credit-card or split payment → Payment Method Update.

• SUBSCRIPTIONS & PLANS
  – Any upgrade, downgrade, or unexpected change of plan/tier (even if money is also mentioned) → Subscription Upgrade/Downgrade.

• AUTHENTICATION
  – Missing, invalid, or stuck codes/OTP/2FA apps/texts → Two-Factor Authentication.
  – Reset links/emails or forgotten passwords → Password Reset.
  – Repeated or strange login prompts, generic inability to sign in (no code or reset focus) → Login Issues.

• ACCESS & PERMISSIONS
  – Greyed-out button or “not enough rights” → Permission/Access Issue.
  – Feature exists for others but not this user → Feature Access Issue.

• APP-SPECIFIC BUGS
  – Mobile-only malfunction → Mobile App Issue.
  – Desktop-only malfunction → Desktop App Issue.
  – Anything else broken or erroring → Technical Bug Report (unless it fits a more specific rule above).

• DOCUMENTS & DATA
  – Downloading/exporting user data/history → Data Export.
  – Questions on data retention/deletion/sharing → Privacy Policy Question.
  – Compliance with external regulations (GDPR, HIPAA, SOC 2, etc.) → Compliance Inquiry.

• MISC
  – Opinion with no request → General Feedback.
  – UI or capability improvement request → Feature Request.
  – Accessibility accommodation (font size, screen reader, colour contrast) → Accessibility Support.
  – Unauthorised activity / hacking fears → Security Concern.

FEW-SHOT GUIDANCE
(Queries are shortened for space; follow the mapping pattern.)

1. “Returned my headphones last month, still no refund.” → Product Return
2. “Got double charged again??” → Billing Inquiry
3. “Change card option just spins forever.” → Payment Method Update
4. “Little pop-up with the numbers never comes.” → Two-Factor Authentication
5. “Charged even after switching to free plan.” → Subscription Upgrade/Downgrade
6. “Delete my info if I leave?” → Privacy Policy Question
7. “Beta sign-up—nothing looks different yet.” → Beta Program Enrollment
8. “App freezes when uploading PNGs.” → Technical Bug Report

CATEGORY LIST
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Feature Access Issue
Permission/Access Issue
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback

OUTPUT
One exact category name from the list above.

Keep in mind the new prompt is not deterministic. We are using an LLM to generate optimized prompts at every epoch and therefore the new prompts (and their accuracies) will not be the same, but you should consistently see improvements based on our testing.

PreviousOptimizing Coding Agent Prompts - Prompt Learning NextFew Shot Prompting

Last updated 2 months ago

Was this helpful?