LLM Retrieval Augmented Generation (RAG): Quickstart Essentials

amber roberts arize
Amber Roberts,  Machine Learning Engineer  | Published January 01, 2024

Understanding the steps required to get the most out of your search and retrieval use case

Creating chatbots that are customized for your specific business needs involves leveraging your unique knowledge base. With retrieval augmented generation (RAG), your chatbot is empowered to enhance its understanding by incorporating pertinent information from your own documents.

What is Retrieval Augmented Generation?

Retrieval augmented generation is a technique where the content produced by large language models (LLMs) is enriched or augmented through the addition of relevant material retrieved from external sources. The core strength of RAG lies in its method of data retrieval, providing LLMs with extra context that significantly enriches the content generation process.

The RAG roadmap here lays out a clear path through the complex processes that underpin RAG from data retrieval to response generation. In this article, we will explore these steps in detail and examine the differences between online and offline modes of RAG. Our journey through the RAG roadmap will not only highlight the technical aspects but also demonstrate the most effective ways to evaluate your search and retrieval results.

What Should You Know Before Using LLM Retrieval Augmented Generation?

Like with any roadmap, in order to map it out you first need to know where you are going and how you are getting there.

You should likely avoid RAG if you are trying to optimize for cost, want to avoid data leakage to a proprietary model, are new to prompt engineering or are fine-tuning.

  • Trying to reduce cost and risk; you are trying to optimize performance.
  • Using proprietary data. If the LLM application you are using does not require secure data to generate responses you can prompt the LLM directly and use additional tools and agents to keep responses relevant.
  • New to prompt engineering. We recommend experimenting with prompt templates and prompt engineering to ask better questions to your LLMs and to best structure your outputs. However, prompts do not add additional context to your user’s queries.
  • Fine-tuning your LLM. We recommend fine-tuning your model to get better at specific tasks by providing your LLM explicit examples. Fine-tuning should be used after experimenting with performance improvements made by prompt engineering and by adding relevant content via RAG. This is due to the speed and cost of iteration, keeping retrieval indices up-to-date is more efficient than continuously fine-tuning and retraining LLMs.
Adapted from an X thread between Andrej Karpathy and Aparna Dhinakaran, this charts task accuracy versus effort/complexity for foundation models

If you are building a RAG system, you are adding recent knowledge to a LLM application system in hopes that retrieved relevant knowledge will increase factuality and decrease model hallucinations in query responses. ​​

What Are the Key Components of LLM RAG?

Key components of rag include the retrieval engine to facilitate search, the augmentation engine to integrate the retrieved data with the query, and the generation engine where the response is formulated using a foundation model.

Retrieval augmented generation is an intricate system that blends the strengths of generative AI with the capabilities of a search engine. To fully understand RAG, it’s essential to break down its key components and how they function together to create a seamless AI experience.

Here is more detail on the important components of a RAG system.

  • Retrieval Engine: This is the first step in the RAG process. It involves searching through a vast database of information to find relevant data that corresponds to the input query. This engine uses sophisticated algorithms to ensure the data retrieved is the most relevant and up-to-date.
  • Augmentation Engine: Once the relevant data is retrieved, the augmentation engine comes into play. It integrates the retrieved data with the input query, enhancing the context and providing a more informed base for generating responses.
  • Generation Engine: This is where the actual response is formulated. Using the augmented input, the generation engine, typically a sophisticated language model, creates a coherent and contextually relevant response. This response is not just based on the model’s preexisting knowledge but is enhanced by the external data sourced by the retrieval engine.

The RAG Roadmap

How Can You Ensure LLM RAG Systems Provide Accurate Answers Based On the Most Current and Relevant Information Available?

llm rag how it works diagram

To ensure accuracy and relevance, successful RAG applications leverage data indexing, input query processing, search and re-ranking, prompt augmentation, response generation, and evaluation.

  1. Data Indexing: Before RAG can retrieve information, the data must be aggregated and organized in an index. This index acts as a reference point for the retrieval engine.
  2. Input Query Processing: The user’s input is processed and understood by the system, forming the basis of the search query for the retrieval engine.
  3. Search and Ranking: The retrieval engine searches the indexed data and ranks the results in terms of relevance to the input query.
  4. Prompt Augmentation: The most relevant information is then combined with the original query. This augmented prompt serves as a richer source for response generation.
  5. Response Generation: Finally, the generation engine uses this augmented prompt to create an informed and contextually accurate response.
  6. Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on accuracy, faithfulness, and speed of responses.

By following these steps, RAG systems can provide answers that are not just accurate but also reflect the most current and relevant information available. This process is adaptable to both online and offline modes, each with its unique applications and advantages.

llm rag system how works
Adapted from source

Retrieval Augmented Generation in Action: Example Application

To understand the practical application of RAG, let’s consider a scenario where RAG is employed in a chatbot designed to query a private knowledge base.

how retrieval augmented generation works search and retrieval
Scenario: AI Chatbot in Customer Service

Imagine a customer service chatbot for a large electronics company. This chatbot is equipped with a RAG system to handle customer queries more effectively.

It handles:

  • Customer Query Processing: When a customer asks a question, such as “What are the latest updates to your smartwatch series?”, the chatbot processes this input to understand the query’s context.
  • Retrieval from Knowledge Base: The retrieval engine then searches the company’s up-to-date product database for information relevant to the latest smartwatch updates.
  • Augmenting the Query: The retrieved information about the smartwatch updates is combined with the original query, enhancing the context for the chatbot’s response.
  • Generating an Informed Response: The chatbot, using the RAG’s generation engine, formulates a response that not only answers the question based on its internal knowledge base but also includes the latest information retrieved, such as new features or pricing.

This scenario showcases how RAG can enhance the effectiveness of AI in customer service, providing responses that are both accurate and current. The integration of RAG allows the chatbot to offer information that might not be part of its initial programming, ideally making it more responsive to the user’s needs.

Why Is The Development of LLM RAG Significant to the Industry?

By seamlessly integrating search and retrieval capabilities with generative AI, LLM RAG systems offer a level of responsiveness and accuracy that is unparalleled in traditional language models.

The significance of RAG lies in its ability to enhance AI responses with real-time, externally sourced data, making AI interactions more relevant and informed. This has vast implications across various sectors, from improving customer service chatbots to aiding in complex research and data analysis tasks.

Now, if you wanted to make your roadmap repeatable for everyone else who would want to take the same route you just took, wouldn’t you? This is where LLM observability comes into play.

How Should LLM RAG Systems Best Evaluated?

RAG evaluation leverages retrieval metrics and response metrics to vet whether answers are accurate, relevant, toxic, timely — or something else. See our companion piece for more details.

Get Started

Using Arize-Phoenix, an open source solution for AI observability and large language model evaluation, we can map each step you took to create your RAG application system – try it out for yourselves!