A Developer’s Guide To LLMOps (Large Language Model Operations): Operationalizing LLMs
Three Keys of Effective Large Language Model Operations
This blog is co-authored by Aparna Dhinakaran
Operationalizing machine learning models in production and monitoring them effectively has been an emerging topic in the past few years. Whenever a model starts to fail silently in a production environment, it is critical to have the right set-up to understand the issue and troubleshoot the model in a timely manner. The use of GPT-4 as a replacement for various traditional model tasks is growing daily. What many teams consider a model today, may just be a prompt-and-response pair in the future. As teams deploy large language models to production, the same challenges around performance and task measurement still exist. Hence, LLMOps is essential to scale large language models and deploy them to production effectively.
What Is LLMOps?
Large language model operations (LLMOps) is a discipline that combines several techniques – such as prompt engineering, deploying LLM agents, and LLM observability – to optimize language models for specific contexts and make sure they provide the expected output to users.
In this article, we cover each of the techniques in detail and talk about the best practices to maintain LLMs in production.
Prompt Engineering
The concept of prompts and responses started to gain popularity after the introduction of large language models. A prompt is simply the specific task a user provides to a language model, and the response is the output of the language model that accomplishes the task. For example, a user might provide a medical report from a recent diagnosis and ask ChatGPT to summarize the document. In this case, the medical document and the action to summarize would define the prompt and the summary itself would define the response.
Prompt engineering can be simply defined as the ability to talk and receive information from an AI software like ChatGPT. The better you are at prompt engineering, the better you are at communicating with large language models in order to make them complete specific tasks. A carefully crafted prompt can guide the model towards producing the desired output, while a poorly crafted prompt may result in irrelevant or nonsensical results.
What Are the Prevailing Approaches for Prompt Engineering?
Common approaches to prompt engineering include few-shot prompting, instructor based prompting, chain of thought prompting, and automatic prompt generation. Let’s dive into each.
Few-Shot Prompting
Few-shot prompting is a prompt engineering technique where the user provides a few examples of the task that the large language model should perform as well as a description of the task. This is a very useful technique to use when you have specific examples of a task since the language model will significantly try to adapt to the example format while generating the response.
Instructor-Based Prompting
Instructor based prompting is based on instructing the large language model to act as a specific person while performing the desired task. For example, if you are trying to write a blog about a specific topic, an example prompt could start with “I want you to act like a genius writer on the topic of…”. With this way, the response would be optimized.
Chain of Thought Prompting / CoT Prompting
Chain of thought prompting (CoT prompting) is used to accomplish complex tasks where the user breaks down a specific task into smaller sub-tasks and instructs the language model to perform small tasks in an incremental order in order to get to the final desired outcome. You can combine CoT prompting with instructor-based and few-shot prompting to get the best results.
Automatic Prompt Generation
Finally, large language models can also be leveraged to generate prompts for a specific task. The user simply describes the task that they want to accomplish within a few sentences and asks the language model to come up with different options. The user then searches for the best prompt and gets to choose the prompt that he or she is the most interested in.
Why Are Prompt Templates Important?
Other than prompt engineering, using prompt templates are crucial for deploying task-specific LLMs into production. Prompt templates can be defined as pre-amble texts that is placed right before a user’s prompt. By using prompt templates, LLM developers can standardize the output format and quality regardless of the simplicity of the prompt provided by the user. Prompt templates create a scalable and reproducible way to generate prompts and can contain instructions to a language model, few-shot examples or different chain of actions to be executed. Let’s look at an example:
prompt_template = """
I want you to act as a branding expert for new companies.
You need to come up with names to certain tech startups. Here are some examples of good company names:
- search engine, Google
- social media, Facebook
- video sharing, YouTube
The name should be short, catchy and easy to remember. What is a good name for a company that makes {product}?
"""
The example above is a prompt that uses few-shot and instruction-based language to prepare a language model for a reproducible output. Using this template, we can deploy an LLM application that can generate unique names for different sets of products. All the user has to do is to enter the product type!
How Can Prompts Be Managed At Scale?
Other than prompt engineering and prompt templates, it is important to consider prompt management within a production environment. Within your LLM application, you can have different prompt templates and user inputs running continuously and it is very important to store your prompts and control their workflow. Hence, the ability to swap out production prompts or iterate on prompts during application development should be considered. For example, based on user feedback, you might want to conduct some A/B testing with different prompt templates and track the performance of each prompt in real-time.
What are LLM Agents?
Apart from managing prompts effectively, developing specific LLM applications tailored to a particular context or task can prove to be a challenging endeavor. This typically involves collecting relevant data, utilizing various methods to process it, and tweaking the LLM to ensure that it can deliver optimal responses within your business context. Fortunately, there are several tools available that can help streamline the process and enable you to scale your applications more efficiently.
One of the most popular tools among LLM developers is the LLM agent. This tool assists users in generating responses quickly by creating a sequence of related prompts and answers in a logical order. LLM agents leverage the power of LLMs to determine which actions to take based on the user’s initial prompt. They utilize different tools that are designed to perform tasks such as searching a website or extracting information from a database to provide a comprehensive and detailed response for the user. Essentially, agents combine LLMs with prompt templates to create a series of prompt-response pairs that ultimately provide the user with a final answer.
Agents can act as a blend of experts, drawing context-specific data from various sources and utilizing the appropriate prompt templates to find the most valuable information for the user. One of the most prominent examples of LLM agents is LangChain, which commonly employs the concept of retrieval augmented generation. This approach involves using chunks of documents to identify the most relevant information that will answer a user’s query. A diagram of an LLM agent architecture is provided below:
Example of an LLM agent architecture for a product documentation chatbot
What Is LLM Observability?
As mentioned, what many machine learning teams are trying to achieve might be accomplished with a chain of prompts or agents in the future. So just like traditional machine learning observability, LLM observability is a must for deploying any LLM application at scale.
LLM observability is a tool for making sure that all the prompt templates, prompts, and responses are monitored in real time and prompt engineers can easily understand and find the root cause of any negative feedback and improve their prompts.
What Data Is Collected By An LLM Observability System?
The above diagram shows what LLM observability looks like in the world of foundational models. The interface into and out of the system are strings of prompt/response pairs. The inputs and outputs are construct a set of data that is collected by the observability system that include the following:
- Prompt and response
- Prompt and response embedding
- Prompt templates
- Prompt token length
- Step in conversation
- Conservation ID
- Response token length
- Structured metadata, tagging groups of predictions
- Embedded metadata, additional metadata that is embedded
Embeddings
Embeddings are internal latent representations of information where they are an internal representation of what a model is “thinking” and how it sees that specific data. In a foundational model like GPT-4, teams do not have access to the internal embeddings for that specific model but can still generate embeddings using an embedding generator model. The embedding generator models can be locally run models such as GPT-J or BERT.
These embeddings can then be monitored in real-time across high-dimensional space and any change in behavior or any negative feedback from users can indicate a problem within the LLM application. One method of finding problem responses involves clustering prompts and responses then finding problem clusters through looking at evaluation metrics per cluster, drift per cluster or user feedback – such as thumbs up / thumbs down – per cluster.
Troubleshooting Workflow and Example
The problems captured as part of the detections are shown above, where a certain format of misleading responses are grouped together and highlighted. These misleading responses can be fixed through a number of iterative workflows through prompt engineering or fine-tuning.
Once you find a cluster of issues, understanding what specifically in that cluster is problematic can take some work. We recommend integrating an LLM to do the heavy lifting for you. Your LLM observability tool should have a prompt template for the LLM with cluster data to do cluster analysis and cluster comparisons to baseline datasets, with interactive workflows for EDA-type analysis.
In addition to cluster analysis on the full datastream, many teams want observability solutions to segment their data on structured data related to the prompt and response pairs. This metadata can be API latency information, enabling teams to look laser in on the prompt/response pairs causing a large latency before zooming into clusters. Or they can dig in based on structured metadata provided by the production integration. These can be related to pre-prompt task categories or any metadata relevant to the prediction.
Conclusion
In conclusion, the rapid growth of large language models in various applications has necessitated the development of effective operational strategies and the relatively new discipline of LLMOps to ensure these models perform optimally in production environments. Key components of LLMOps include prompt engineering and management, LLM agents, and LLM observability. By employing these techniques, developers can optimize their LLMs to handle specific tasks, efficiently manage prompts, and monitor model performance in real-time. As the adoption of LLMs continues to expand, LLM observability allows for fine-tuning and iterative prompt engineering workflows. By identifying problematic clusters of responses, developers can refine their prompt engineering techniques or fine-tune the model to enhance its performance. This iterative process ensures continuous improvement of the LLM application, leading to a better end-user experience.