Prompt Optimization

Eric Xiao,   | Published November 11, 2024

It’s easy to build prototypes using large language models (LLMs), as developers can bypass the need for collecting labeled data or training custom models by simply writing a prompt. However, while this approach provides a strong starting point, it doesn’t guarantee optimal results. You’ll need to do some prompt optimization.

LLMs are generalized models that aren’t fine-tuned for specific tasks. Although a standard prompt can make these applications demo impressively, they often fall short in production environments, where reliability is critical. This is where prompt optimization becomes essential: you should always be fine-tuning your prompts to ensure consistent and accurate performance in real-world use cases.

Level Up: Effective Prompt Optimization

This guide walks you through how to optimize your prompts to ship LLM applications that are reliable, accurate, and fast. We’ve broken down prompt optimization into six levels, listed below. At each level, we add an additional optimization to an example prompt.

  • Level 1: Craft specific instructions
  • Level 2: Account for uncertainty
  • Level 3: Get LLM feedback
  • Level 4: Use A/B tests and experiment
  • Level 5: Hire LLMs as a judge
  • Level 6: Annotate with human feedback
  • Level 7: Build self-correcting AI

here about Netflix earnings. We’ll start with the most basic prompt we can think of:

Summarize this finance news article: {article} 

We can use the playground in Arize to test this prompt using GPT-4o or any LLM provider. Let’s look at the results using GPT-4o:

Screenshot of Arize prompt playground: use the playground in Arize for prompt optimization
Testing our prompt in Arize.

You can get wildly different results based on the article, the information you give it, the format of the data, and the expected outcome. The foundation of good prompt optimization starts with being clear and specific about what you want the AI to do.

Let’s start with the following principles, which you can find in prompt engineering guides like promptingguide, learnprompting, and lilianweng’s blog post.

  • Clearly define the role of the model (e.g., teacher, assistant)
  • Assign a clear task (e.g., summarize, explain, analyze)
  • Set constraints and limitations on what you don’t want
  • Specify the response format, which includes fields, tone and length
  • Determine how many examples you need to add

Let’s apply these and write a new prompt:

You are an expert investment analyst. Your task is to summarize this finance news article in under 50 words. Focus on information that materially affects stock prices, such as revenue growth, net margin, free cash flow, new products, and industry trends. Use a formal neutral tone, and present the summary as a bullet-point list. If the article does not pertain to investing, return None.
[BEGIN ARTICLE]
{article}
[END ARTICLE]

Here’s the output:

-Netflix’s Q3 revenue of $9.83 billion exceeded expectations, marking a 15% year-overyear
increase, driven by initiatives like password-sharing crackdown and ad-supported
plans.
- Q4 revenue guidance of $10.13 billion surpasses forecasts, with full-year 2025 revenue
projected between $43-44 billion, up 11%-13% from 2024.
- Operating margin expected to rise to 27% for the year; Q3 operating margin approached
30%.
- Q3 EPS of $5.40 beat estimates, with Q4 guidance set at $4.23 versus expectations of
$3.90.
- Over 5 million subscribers added in Q3, exceeding expectations, supported by popular
content and new releases.
- Significant growth in ad tier, comprising over 50% of sign-ups in markets offered; ads
membership rose 35% quarter-over-quarter.
- Projected price hikes may further impact stock, alongside current apprehension due to
high recent stock performance.

This is much better, but still too long!

Let’s reduce the number of bullet points and words, and ask the LLM to include specific examples.

You are an expert investment analyst. Your task is to summarize this finance news article in under 25 words. Focus on information that materially affects stock prices, such as revenue growth, net margin, free cash flow, new products, and industry trends. Use a formal neutral tone, and present the summary as a bullet-point list. If the article does not pertain to investing, return None. Return at most 3 bullet points. For each bullet point, if available in the article, include specific examples.
[BEGIN ARTICLE]
{article}
[END ARTICLE]

Here’s the new output:

 -Netflix Q3 revenue increased 15% to 9.83 billion, beating estimates; projected Q4 revenut of 10.13 billion exceeds expectations. 
- Company added 5.07 million subscribers, surpassing expectations, driven by new content and initiatives like live events and ad-supported tiers.
-Full-year operating margin expected to hit 27%; EPS also beat estimates, guiding higher for Q4, with potential for price hikes impacting stock positively.

Level 2: Account for Uncertainty

Large language models are much like people, where if you give them time to think during inference, they will give you a more comprehensive answer. When you are thinking about prompt optimization, you always want to account for uncertainty.

Let’s instruct the model to break down its approach and outline its steps before summarizing the article:

 You are an expert investment analyst. Your task is to summarize this finance news article in under 25 words. Focus on information that materially affects stock prices, such as revenue growth, net margin, free cash flow, new products, and industry trends. Use a formal neutral tone, and present the summary as a bullet-point list. If the article does not pertain to investing, return None. Return at most 3 bullet points. For each bullet point, if available in the article, include specific examples. Breakdown your approach and outline your steps before responding with the summary.
[BEGIN ARTICLE]
{article}
[END ARTICLE]

Here’s the output:


To summarize the finance news article, I will follow these steps: 
1. Identify the key financial metrics and developments that could materially affect Netflix's stock price, such as revenue growth, net margin improvements, subscriber numbers, new product launches, or strategic moves. 
2. Look specific data points that underline changs or expectations in these areas, such as revenue and earnings per share (EPS) figures, subscriber additions, and future guidance. 
3. Notate any significant industry trends or company strategies that might influence investor sentiment or stock performance. 
4. Condense the findings into a concise bullet-point list, limited to 25 words in total, which focuses on the most impactful financial news. 
Here is a summary based on the article: 
– Netflix Q3 revenue and EPS exceeded expectations; Q4 revenue guidance of $10.13 billion bets estimates. 
– Added 5.07 million subscribers, boosted by strong programming, and projects more growth with new content. 
– Ad tier growth, sports foray enhance strategic position; analysts anticipate potential price hikes by year-end </

Not bad!

The bullet points are shorter and easier to read. However, we lost the specific Q3 revenue figure. Another technique is to ask the LLM to reflect on the output and see if it can make corrections.

Here’s an example:


system
{output_message}
user
Did the summary meet the assignment? If so, why? If not, why?

Here’s the output:


Yes, the summary met the assignment's requirements. Here's why: 
1. **Focus on Material Aspects**: The summary concentrated on key financial metrics and developments that are critical for investors, such as revenue and EPS exceeding expectations, subscribers additions, and future guidance. These aspects are likely to materially affect Netflix's stock price. 
2. **Conciseness**: The summary was presented in under 25 words across three bullet pointts, ensuring it was concise and to the point, as required by the assignment. 
3. **Inclusion of Specific Examples**: Where applicable, the summary included specific examples such as subscriber growth driven by strong programming and the impact of new content, as well as the anticipated price hikes. 
4. **Formal and Neutral Tone**: The tone remained formal and neutral, aligning well with the expected tone for a financial analysis. 
Hence, the summary effectively distills the article's key financial points in a manner that would be useful for an investor 

You can also break down the problem into multiple LLM calls. Here we asked it to outline its process and then provide the summary in one prompt. An alternative approach would be to use one prompt for extracting the key data points, and a second prompt to take those key data points and create a compelling summary.

Level 3: Get LLM Feedback

The next level in effective prompt optimization has to do with getting LLM feedback. Once you’ve got a prompt that works well, you can ask the LLM to refine the prompt and give you feedback.

Here’s an example:

I am writing a prompt to create a summary for finance news. Is there any part of the prompt that I can improve?
[BEGIN PROMPT]
{prompt}
[END PROMPT] 

And here’s the output:


Output
Your prompt is already well-structured and provides clear instructions, but there is always room for refinement to enhance precision and clarity. Here are a few suggestions for improving the prompt: 
1.**Clarifying Return Instructions**: The instruction "Return None" could be misunderstood. Consider being more explicit about returning "None" or any other specific output. 
2. **Example Requirement**: Specify that examples, such as numerical data or particular innovations, should be used wherever possible, but that summarization should not be overly reliant on minor details. 
3. **Consistence in Tone**: Reiterate the importance of maintaining the tone consistently across all bullet points, possibly by emphasizing the objectivity and factual nature of the information. 
4. **Logical Flow**: Introduce a step-by-step approach explicitly to ensure the methodology is clear. 
5. **Handling Exceptions**:Provide instructions on what to do with partial figures or data points that are detailed in the article. 
Here's a refined version of your prompt: 
As an expert investment analyst, your task is to create a concise summary of a finance news article, encompassing critical data points that could materially influence stock prices. These details may include revenue growth, net margin, free cash flow, new product launches, and emerging industry trends. Provide this in a formal, neutral tone using a bullet-point list. The summary should be:
- Limited to 25 words in total.
- No more than three bullet points.
- Focused on the most pertinent information, with specific examples when available.
- Consistent in tone and objective.
If the article lacks investment-relevant content, clearly state "None" as the output. Ensure every bullet point maintains relevance and clarity. Before summarizing, outline your approach, detailing the steps for extracting and distilling the essential elements from the text.
[BEGIN ARTICLE]
{article}
[END ARTICLE] 

You can also use Arize’s prompt optimization feature to get in-line suggestions. Here’s an example of this feature in action:

Arize's prompt optimization feature in action.In-line prompt optimization suggestions in Arize.

Level 4: Use A/B Tests and Experiment

Now we have a pretty good prompt that gives a great summary for this one article about Netflix earnings. But how do we know if the prompt will actually work against real life use cases? The next level of prompt optimization is using A/B tests and experimenting. You need data to prove whether it’s working, so we need to build a corpus of articles to test our summarization prompt against.

This is where datasets and experiments are instrumental in leveling up our reliability and accuracy. There are many ways to measure reliability, it’s not a singular metric. It’s comprehensive across every part of your application. You can look at what the user input contains, what document chunks you’re retrieving to put into the context window, what the response contains, and what the qualitative and quantitative characteristics of the response output are.

Chart that shows some possible scenarios of what the input might contain, what the retrieval evaluation is, what the response contains, and what the response evaluation could be, including: correct, toxic, hallucinated, impolite, or faithful
More ways to measure safety and relevance when optimizing prompts.

Let’s create a corpus of 10 articles to summarize across different companies and article types. The more data points you have the better, but you want to avoid overfitting and only testing for one particular use case across all your data points.

Using Arize Phoenix, you can manage your datasets in the platform and compare different prompts, models, parameters, and evaluation scores using experiments.

You can load a dataset of example financial news articles from HuggingFace using the load_dataset function.

Now that we have a dataset of articles, we can run experiments. Below we have three examples, comparing the different results using prompts that we wrote above.

Optimized prompt, concise template, and initial template.

This is where you start with manual inspection of your summarizations and see what’s gone wrong. You’ll notice that the optimized prompt still includes the instructions from the “outlining of the approach” and “detailing of the steps.” In our actual application, we’ll need to extract the steps away as metadata and only include the final summary for evaluation.

This brings us to our next step…

Level 5: Hire LLM as a Judge

The more data points you add, the more difficult it becomes to evaluate your LLM application manually. In order to ensure that your application performs well, you need a way to judge the quality of your LLM outputs automatically.

This is where you can use LLMs to judge the qualitative characteristics of your output instead of using human annotations. Manually inspecting your outputs can go a long way at the beginning, so you need to move to this stage only when you feel like you can’t handle looking at every single trace anymore and need to see aggregate metrics in order to determine reliability.

LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.

We have many articles and guides on how to get started with LLM as a judge including this guide and this research discussion ).

Level 6: Annotate with Human Feedback

Large language models were built with humans in the loop providing feedback, and likely your application will benefit from humans in the loop as well. We support annotations in Arize to give you easy ways to provide structured labels and scores for every LLM call.

Below is an example of judging a summary on clarity, with the labels clear and unclear:

As you label hundreds of spans with these annotations, you can use our filters to look for patterns and problems to continue improving the resiliency of your application:

Level 7: Build Self-Correcting AI

So now, you have a great prompt. You’ve accounted for uncertainty. You have real world use cases covered, and you have datasets. You’ve experimented across models and prompts and A/B tested the heck out of it. You have both LLM evaluations and human feedback.

Once you’ve gotten through all of those steps, you’re ready for the last one, which is building self-correction into your application.

There are several ways to do this:

  1. Ask LLMs to update your prompts when an example is evaluated as negative
  2. Improve the accuracy of your retriever when inserting documents into the context window
  3. Insert the right examples at LLM query time by re-labeling problematic inputs with the right outputs
  4. Use libraries like DSPy
  5. Fine-tune your model based on your dataset of examples

These are more advanced techniques, which we have articles linked for some above if you want to explore more.

Congratulations for making it this far! You are now a prompt optimization guru.