Community papers - resources image

Retrieval-Augmented Generation – Paper Reading and Discussion

Published Jun 9, 2023

Sarah Welsh

Contributor

Introduction

In this paper reading, we discuss “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”

We know GPT-like LLMs are great at soaking up knowledge during pre-training and fine-tuning them can lead to some pretty great, specific results. But when it comes to tasks that really demand heavy knowledge lifting, they still fall short. Plus, it’s not exactly easy to figure out where their answers come from or how to update their knowledge.

Enter RAG models, a hybrid beast that combines the best of both worlds: the learning power of pre-trained models (the parametric part), and an explicit, non-parametric memory — imagine a searchable index of all of Wikipedia.

Join us every Wednesday as we discuss the latest technical papers, covering a range of topics including large language models (LLM), generative models, ChatGPT, and more. This recurring event offers an opportunity to collectively analyze and exchange insights on cutting-edge research in these areas and their broader implications.

Watch

Dive in:

Transcript

Aman Khan, Group Product Manager at Arize AI: So we’ve got a few folks in the community joining some familiar names and faces. Cool. Why don’t we kick things off just in the interest of time? I’ll just kind of briefly introduce myself. My name is Aman, I’m a Product Manager here at Arize. So my job is really to hear customer feedback, customer pain about what’s going on when you’re observing or monitoring models and learning about how to make the product better, and then feeding that back into the engineering team, and working with engineers to build out our product roadmap. I’m joined here by Michael, our CTO. Michael, if you want to give a brief introduction.

Michael Schiff, Chief Technology Officer of Arize AI: I’m Michael, I’m the CTO, I have a background in distributed systems and ML engineering and I’m excited to talk about this paper.

Aman Khan: So I think when we were kind of like pre talking about this a little bit. Why did we select this paper, what’s novel about this technique? And let me go ahead and share my screen. So we’re kind of looking at the same visual tier. So we’re going to be chatting about retrieval augmented generation. This has been a pretty hot topic of discussion for a little while now, specifically, when it comes to applying LLMs for actual useful applications on top of your own data. And I think maybe one interesting nuance to call out that we’ll get into a little bit more is specifically we’re talking about knowledge intensive tasks which will define and get into more depth. But we’re gonna go through the sort of canonical paper on retrieval augmented generation for knowledge intensive tasks and the sort of approach of RAG.

We’ll talk about how it works, why it works the way it does, and then kind of build off of that, maybe do something a little bit different than what we’ve been doing, which is, we might be hopping between different models in this paper read to look at what other companies are doing in industry and some of the interesting work on top of the foundation of this concept.

Please feel free to leave questions, we’ll answer them in real time. So feel free to leave any comments or questions, so we can help answer.

Michael Schiff: Will you swap over the slide show mode up in the upper right?

Aman Khan: Yes. Well, we’ll be talking between the papers so we might be jumping between screens here. Can folks still see if I pull up a PDF? Does this still work?

Michael Schiff: I’m able to see okay, I think it’s just a maybe a clearer view.

Aman Khan: Yeah. So we’ll be hopping between slides and the paper a little bit just up, just as like raw notes. But we want to kind of stick to the paper, and then, as we’ll jump to different slides. Michael, do you want to kick us off with the abstract? What’s novel about this? I think you yesterday you were like, well, okay, so what’s this model actually doing?

Michael Schiff: I came to the paper, not realizing this was sort of like the paper around retrieval augmented models, and I was like, this is what people have been doing for weeks! But at the time we live in, a week feels like this glacial time period. But this is a technique that’s really picked up in popularity for augmenting generative models. Specifically, adding what the authors call non parametric memory to parametric models. The insight that they build off of is sequence to sequence models–which we’ve all become very familiar with in the last a couple of months– encode a lot of real world factual information into their parameters. The parameters aren’t just encoding the structure of language. And what makes up a sequence of speech or text that is plausibly human, but it has information that we believe to be true about the world, encoded into its parameters. But not everything. And you hinted at this in the beginning, but really for knowledge, intensive tasks. And the authors set those aside as tests that even humans wouldn’t be able to do without like an encyclopedia or look up to go answer those questions.

We could synthesize a response given that external reference source. But we need that, too, we don’t have it encoded into our parameters, either. So this is a technique for augmenting a sequence to sequence parameterized model with a non parametric memory. And talk more about what that looks like.

Aman Khan: Yeah, awesome. So I feel like one of the takeaways was what is parametric and non parametric memory, and what does that mean? So we’ve kind of all experienced chat GPT, hallucinating, giving an inaccurate response, etc. What’s interesting about what we’re going to hop into is really the architecture of this type of model, and how it handles hallucinations when you’re actually retrieving on top of your own data.

So, kind of just jumping right in. I mean, I feel like this diagram really kind of conveys a lot of the architecture.

I feel like this take away, which is wild. This development is exciting. I mean, basically. But saying, Okay, pre-trained language models can do a ton of stuff. They can do this without access to external memory. But when it comes to actually answering questions about specific topics, how do pre train models perform as opposed to when you’re accessing memory. And I think that this diagram really kind of gives a lot of insight into this model’s architecture. So, Michael, do you want to do you want to take a step and maybe breaking down like what’s going on here from an architectural perspective. I think specifically, there’s kind of a few components to talk about which would be like the query encoder, the retriever, document index, and the generator.

Michael Schiff: Yeah. so I guess before going into the exact architecture, kind of a high level overview on the intuition behind the technique which is effectively given an encyclopedia, an external reference source. We want to select from it relevant things like, if I were to ask you: how do I change the time on my car stere? You don’t know the answer to that, you don’t have it memorized. But I throw you my Honda manual, and you are gonna be able to tell me the answer. So you flip through, and you figure out what are the relevant passages. You find the relevant passages, you summarize and synthesize a response. You hold this button, and then you you know toggle through whatever to to set the time. And so this architecture is a specified version of that. The the non parametric retriever is the look up on this factual index structure. And the the form that that takes here and in most of the implementations we’ve seen is in a vector database so specifically a storage system that can store and retrieve vectors which correspond to passages of text from your index, and which can be retrieved with a semantic search. So you’re not like selecting based on labeled tags. But you’re able to find relevant information in a learned capacity. And then you’re augmenting the generation process, which is this, you know, sequence sequence model. So we’re familiar with transformers, which already knows how to spit out a reasonable sounding answer, and might, in fact, have a little bit of the knowledge of this problem encoded in its parameters? It might know what a car is, and It know what a car stereo is, but it doesn’t know the specifics of my model Honda or whatever

You augment the question. So, how do I change my car stereo time? With these documents and feed that in. So going left to right, the architecture looks like a query encoder, which takes your query and turns it into the vectors that our model is going to understand a maximum inner product search, which is the semantic lookup on our vector database. So this vector database holds encoded strings of text from–in the paper it’s Wikipedia, but one of the interesting things we’ve seen about this architecture is that it’s very generalizable, and that index structure can contain facts about anything you want.

So I’ve been using this Honda car manual example, but you do a maximum inner product. Search for that. You get your top documents in the paper. And then you generate possible sequences. Given those documents, they have different approaches to token versus sequence generation. From there, it’s working the same way as the parametric generators work always, which is to guess at the next probable token. Given the input effectively, trying to pick the most likely string that you would expect. Given the input, we’ve just augmented the input for this retrieval task.

Aman Khan: Yeah, I think that’s gonna be really interesting when we get into the trade-offs between those approaches of token versus sequence-based generation. But I wanted to actually pause a little bit and talk about the query encoder and the retriever. So from the paper, it looks like the query encoder is really just chunking up documents based on like, roughly how many words. It’s like Wikipedia articles–is it a hundred? I think it’s a hundred words, maybe it’s a hundred characters, but it’s some preset sort of document chunk that is flexible.

They just basically took like an arbitrary sort of method of chunking up the documents. And then they actually just kind of use an off the shelf, what they’re calling the retriever here is like a dense passage, retriever. And it’s really just a model that is purpose built to retrieve documents from where their index and store from the query encoder. I think that’s pretty interesting, because it means that that’s like some level of flexibility as well off the shelf. I think they just use BART here as well. So they basically just use kind of an off the shelf model and said, Okay, here’s your corpus of information. Now get really good at pulling out information from here. And that’s that’s basically all this like Dpr is, which is sort of the meat of retrieval. Any thoughts on that, Michael? Why, like, it’s kind of interesting, like an LLM, you know, using using a retriever or Dpr, to kind of pull documents on. It’s kind of interesting to reason. It’s almost like a reasoning engine about the the index. Right?

Michael Schiff: I mean to be honest, this is an area where I had a little bit of uncertainty about their methods. because if you look lower down, they describe the actual I mean, I hate to use an overloaded word, but like retrieval from the document index with face and extracted with hierarchical, navigable small world indexes. I need to do a little bit more reading about what the depth passage retriever is. If that’s the name for this approach.

The other thing that I thought was really interesting was in their fine tuning process. They choose not to fine tune the document encoder. So they have several encoders which they seem to refer specifically to like the vector, encoding of a passage, and everywhere it’s for it. So it’s like, you know, it’s like there’s but they so they fine tune. But the one that they don’t fine tune is the document encoder, because this would require rebuilding the index. So I think it speaks to some of the intentions, I guess, of the author. If I can be presumptive and assume some of that is they were trying not to build special purpose models for retrieval tasks. Such things existed already. The goal here seems to be to take fairly off the shelf components in a general fine-tuning process, and then not need to do a lot more work on top of that, and they, in fact, give out their index of Wikipedia, and if the problem required fine tuning of the document. Encoder, you’d have to throw away that index and start over, and to the expense of some of these steps where you’re like, use pre-trained stuff where possible, do a minimal fine tuning step on just the components that are cheap to find, too.

Aman Khan: yeah, we’ll get a little bit deeper into like the index to. And what swapping the index looks like, but maybe for the benefit of folks that might be learning about this for the first time. Can you talk a little bit about what parametric memory is, and what non parametric memory is? What’s the trade off? What’s the difference between as soon as we’re using those terms.

Michael Schiff: I mean to me, to me they feel a little bit like fancy words for stuff that I would guess that a lot of people who are, if they’re present here, are actually already familiar with the concepts. In the paper “Parametric memory” really just refers to the stored factual information about the world which is present in our sequence to sequence model parameters. They encode it with that Theta. So you have some sequence to sequence model that knows how to generate probable output text from input text. And this is actually one of the–I don’t want to take it too philosophical too early–but one of the interesting things I think about, you know, sequence sequence models in general and this notion of hallucinations, which is like a probable output text, feels like, it’s not a line at which point you know it passes the Turing test. But for something to be plausible, possibly human. It first needs to be like you read it, and it doesn’t sound like gibberish. And then that was like Markov chain generators, you would read it, and it’d be like token like I am, and you, you know it makes sense on any gram chunk. But you read the whole thing. It doesn’t make any sense, and then Transformers added attention to that. So you would read a paragraph, and, like the third sentence in the paragraph, was contextually relevant with the first. But that’s not the end state. That’s a step along the path. And then now, if you’re talking to a person you have this like BS sensor, right where it’s like you’re a human, but you’re lying to me or like you don’t know what you’re talking about. It’s interesting what we consider a hallucination versus I don’t know the transformer. These generators are doing what they were always meant to do, which is predict probable output text.

So, coming back to your question, the parametric memory is not just that it knows how to generate language that is syntactically correct and semantically well formed, but is about the universe that we actually live in. It’s not, you know, an English language sequence about a made up fantasy land. And this is parametric memory. Non-parametric memory is what you’ve added to your model to be able to extend those encoded parameters. You spent a lot of money encoding structures of language facts about the world into your sequence model parameters. And you don’t want to keep doing that over and over and over again. This is quite an expense. So this technique adds an additional source of information in the form of non-parametric memory. And I don’t know that it would always need to take the form of a vector database–in this paper it does. But really, I think it would be anything that your model can leverage which is not encoded into its parameters. So I would even make the argument that if you’ve trained your model to make calls to Google, Google is acting as a form of non-parametric memory for your model. So it’s really your model’s ability to reference some external index structure. That’s not its train parameters.

Aman Khan: Yeah, I also find it interesting that they call out specifically the Dpr follows a by encoder architecture. So that’s the relationship between the query and the document that’s being referenced. They’ve specifically trained this Dpr with a buying encoder architecture to simplify the architecture. I mean, you could use things like crossing coders. You could use much more complex encoders. But the idea is like, let’s just throw something really simple at a large corpus of data, so that we can build up these relationships in the form of this non parametric memory. And I think that that means that there’s some concept of this like scaling. Basically, you can throw more, you know, swap out the index and the encoder architecture remains the same, and so you will get roughly the same performance, which I think is pretty interesting.

Michael Schiff: Parametric memory is also what’s allowing some of the fine tuning. So we said earlier that they don’t fine tune the document encoder this D,Z, but they do fine tune the query encoder over time. Go through the fine tuning process, they will produce better encodings of your query, such better defined as when used to retrieve documents. We’ll pull out more relevant documents, having not changed the document encoding, just better encoding of the query which I found really interesting.

Aman Khan: Yeah, exactly. And they talk about that in the training method here: “Updating the document encoder during training is costly, it requires a document index to be periodically updated,” and they reference the canonical work before that which is a realm which is sort of pre-training with retrieval, augmentation. And they do not find the stuff necessary for strong performance and keep the documents encoder fixed, only find toning the query encoder BERT and the BART generator. So it’s exactly what you’re saying, Michael, which is like you could also throw the encoder and and basically use retrieval augmentation during the pre training step, but that just makes it so much more expensive as an architectural perspective versus just keeping that encoder fixed and swapping out your query on and your sort of knowledge based on top of that.

Michael Schiff: I think this also hits on the online implementations that we’ll start to see of this where your document index is not static, but is growing over time. It’s this expensive, incrementally built thing that you’re slowly indexing more and more information, and likely than you need to fine tune your query encoding to better access this growing corpus of data.

Aman Khan: We’ll take a slight aside into RAG token versus RAG sequence as well, and I think this one was pretty interesting. They experimented with like: Can you kind of predict the next token, or the next sequence based on the previous token of sequence, that you retrieve when you retrieve context? And then they basically threw these approaches at a bunch of experiments to compare. Okay, which one is actually you know, like, there’s straight offs between either of them. But the interesting one rack token we can plug into a standard beam encoder when it comes to rack sequence, you need more efficient decoding. So there’s sort of like trade offs between either token or sequence, and we’ll kind of get into some of the experiments here. So what they did– the authors of this paper–is through Open Domain question answering, which is, you know, kind of answering an open-ended question. And in this case again, you are referencing Wikipedia. There is Closed-Book QA, or Open-Domain QA. So you know, Closed-Book might be like, what’s the tallest mountain in North America? Something that has like, you know, sort of a specific answer. And then, what’s a good Open Domain QA or question, Michael, that you can come up with?

Michael Schiff: Just an aside of what we were just talking about–the thing I thought was interesting about the discussion of Closed Book here is it’s hard to argue that there is any form of RAG that is truly closed book. And also distinguishing between closed book, open book, and closed domain, open domain. So something that has like a super concrete right answer–that’s Closed-Domain. The open book, closed book is like: Do I have my reference manual? Am I relying only on parametric memory? So I think it’s interesting that they talk about close book retrieval, because there’s no it’s a close book, RAG.

Aman Khan: It’s like a look up at that point right? You’re just like looking at the reference at that point?

Michael Schiff: Correct, which is not cheating or anything. They bring that up at the very beginning, that knowledge intensive tasks are those that cannot be expected to be answered without this external reference. I’m struggling to think of an open domain.

Aman Khan: Yeah, I’m trying to remember. It’s like something that you know…

Michael Schiff: Oh, what’s the weather in Volcano, California?

Aman Khan: Right, where there is a specific answer. But it’s like dependent on real time data and it depends on how up to date your data is. And then they have these sort of abstractive question answers, there’s no supply passes only the question and answers. And they use MSMARCO. Which is Bing questions basically. So you know, some of those will require a look up that Wikipedia won’t even have information on. Actually, I thought this was interesting, this data set, by the way, on abstractive question answering, using MSMARCO, it’s things that you would expect to see from Bing, which is like very short questions. Actually, it’s like, you know, what’s you know, like book a flight or whatever. But it’d be a question like it’s not. You’re not going to get like a very long form, like what you might expect to see from a Bar exam, or like a medical exam. So it’s really pretty practical as a data set.

And then Jeopardy question generation, which is pretty interesting. This is all just sort of set up on like the evaluation. But then they talk about with Jeopardy questions you have to have the context, and some of the questions are phrased in a certain way, and you have to infer what the question is asking and know what the context is that they’re asking about. So it’s this, this is a pretty complex

Michael Schiff: That one was particularly interesting and highlighting some of the differences between RAG sequence and RAG token.

Aman Khan: Exactly. So now we’re going to jump into that. So I’m actually going to hop ahead a little bit so open domain question answering, open book retrieval based approaches. The main takeaway here is, you know, RAG, basically the main take away, I think, from here is like an open book, closed book is that RAG actually performs better than these large sort of state of the art language models. so if you’re even if you’re even if you’re comparing to like billion per billion. For in our models, these are like what you were kind of mentioning before, you know, this is at a time when, like the larger language model like that meant it was like that much better the sort of state of the art. So billion parameters. This is like pre-GPT days so I mean, you’re really taking like an underpowered model, and you’re hooking it up to this knowledge base, and it performs better than a very powerful parametric memory language model. So I thought that was like the main takeaway. What I wanted to jump into actually, is this example which you’re going to talk about Michael. I’m guessing is the Hemingway example of when you’re asking Jeopardy questions. So, what is actually happening underneath the hood? What’s firing in the memory to answer this type of question. Is this something that is kind of interesting to you as well? This example shows how parametric non parametric memories work together. The sort of jeopardy question here?

Michael Schiff: Yeah, there’s, I mean, there’s a bunch of interesting stuff in this section. I think we should maybe talk for a minute just about the difference between RAG sequence and RAG token. Without going into, like all of the math. RAG sequence after selecting the top K documents for the question, we will use a single one of those documents to generate the entire response sequence, and RAG token will generate token by token, using a different one of those top K documents per token. And we talked a little bit about the decoding and actually generating from those models. And one of the things I thought was kind of. I guess. Surprising was that the rad sequence, the generation seemed to be more expensive, that they required this approximation. And that surprised me at first.

The rag sequence from their results tended to outperform r token except in the jeopardy case. A. And one of the things I thought was interesting there is that they describe this to be due to the fact that a good jeopardy answer often needs to rely on information across multiple documents. So I wonder how much of the increased performance of RAG sequence over at token was due to the types of questions you’re asking. And similarly, the selection of you know, to documents. How much of this is due to the corpus they selected like, if you split Wikipedia up into a hundred word chunks, and then the types of questions they ask, How often are you going to need to reference multiple chunks to get a good answer? And how often is it going to come from a single word chunk? But if you change your problem, will that still be true? So this visualization is very cool. It kind of shows a graphical representation of the posterior probabilities of each next token conditioned on the latent documents that were selected. So in this case they pulled out five documents, and each column can be thought of as the posterior probability for generating the next token given each of those documents.

So, you can see from the heat map, the way different documents contribute differently to the final. And ultimately it’s just going to choose one. But the sun coming from Doc two or coming from Doc one is one of the interesting takeaways.

Aman Khan: And so for context, like document one document two, these are chunks from Wikipedia. And what this is saying is, when I provide this this jeopardy question: The Sun Also Rises is a novel by this author of a Farewell to Arms. It’s like you’re framing it in this way where you need to have the contacts, even though what you’re talking about it’s not even to generate the answers, to, to actually know what to even look up, what’s relevant to look up.

And I thought this was really fascinating actually, in the explanation. By the way, I highly recommend it, we’ll send out you know, maybe a link to the paper. But there’s a video by the first author of the paper, Patrick Lewis, which also kind of talks about this, and I’m like well, we’re both scratching ahead of this yesterday, like, why is, why is it firing on like A Farewell to Arms? Right? Why is document one which is his works are considered classics like, why is that the relevant piece of information with A Farewell to Arms? And it’s actually because most books don’t start with the letter “A” so it actually makes it really powerful and strong to look up a farewell to arms, being pretty unique in this and the index and relevant to the document being asked in the non parametric sort of memory. Here it was like a farewell to arms, and then the sun. Once you have that information that you know you can say, Okay, well, this, when you’re talking about something. This is the relevant sort of document with that artist with that who, based on that look up. So I thought that was pretty interesting as well. based on this visualization of like top documents that are retrieved.

This one fires on a Farewell to Arms, on a letter, on the letter a, so that was pretty interesting based on it the entire sequence. So that’s like a visualization on the token sort of the token based model here.

Do you want to talk a little bit about what’s going on with the index hot swapping, or I guess any thoughts on this overall sort of evaluation that they used?

Michael Schiff: The index hot swapping is interesting. I mean, I think it speaks to the generality of the approach and a little bit of what we were discussing a minute ago. Which is you? You could build that index in an online fashion. You can swap it out. They also discuss the degree to which that index is humanly readable and and humanly modifiable. So, I think it makes practical a way to augment models that are otherwise out of the realm of an individual, to modify their parametric memory. So I think this approach is going to be a way that we see. And I mean, it’s pretty easy to say in hindsight, because it’s already happening but a way of democratizing the use of these models on tasks for which they were not specifically trained, and for which maybe you hold the data privately.

Aman Khan: Yeah, exactly. And they kind of talk about the results of when you swap out the document index, and you do the same comparison of something like you know, sort of off the shelf parametric model and the non parametric–it’s going to perform better, even if you trained that that larger model on on, on some of the the same data.

I realized we actually jumped over one pretty interesting insight from this explanation as well, which was the interaction between the parametric and non-parametric memory. Do you want to chat about how those two kind of work hand in hand about how to generate the response here.

Michael Schiff: Yea so I mean ultimately, the output is still the output of the sequence model. So the end thing that you are reading is entirely the product of parametric memory. I think they throw it in there In one sentence. When all is said and done, they grab a document, and they concatenate it onto the question. So like what this model of this architecture is a way of–they call it retrieval augmented generation. It’s really question augmentation, intelligent question augmentation through an index structure of relevant data.

But the fact that the output comes through a generator–so is the product of parametric memory is is pretty different from other retrieval systems. Maybe the finding of relevant documents is the same, and then they intelligently crop out portions or mask out parts of the document, pull out and extract and a verbatim quote. These architectures, RAG architectures are able to answer questions correctly in cases where the documents don’t contain verbate of the answer. So you might be able to pull out documents which together and in, and you know you would be able to understand the answer. But there’s no one sentence that is exactly the answer these models can produce, whereas an explicit extractive retrieval model wouldn’t be able to do that.

This again, I think, comes back to that philosophical question of the gray area slash continuum from what is plausibly human text to what is factually accurate text to this is a hallucination. And you look at a case where it’s produced a correct answer that was nowhere in the original text, and arguing that that’s different from a prediction which turned out to align with our reality versus a hallucination which doesn’t align with our reality–it’s tough for me to know where those two things are actually different from each other.

Aman Khan: If I can reframe Michael, basically what you’re saying is: RAG is great to be able to pull out the context that’s then used to generate some piece of text or generate some context. So you can find the index and use that to generate something. So that’s the non parametric memory that then flows into parametric memory. But in many ways the parametric memory component is a hallucination, it’s a Wikipedia document. But if you swap this out with your own knowledge, base the non parametric memory, the retrieval is specific to your knowledge base. That’s the index. But the parametric memory is just whatever the LLM is producing. It could literally be a hallucination that sounds correct because it sounds like the context that was retrieved should be able to produce the next kind of token, or the next sequence based on that parametric memory, but it could literally be like, how do you even gauge the correctness of that right? I mean, it could be you could be pulling the right document, the right piece of information and the parametric memory can produce a hallucination that sounds correct.

And I thought that was so interesting from this paper as well that is kind of a little bit covered over, I think, like it’s not really called out super explicitly. But basically they find that because you’ve trained a model on data that you aren’t referencing, and you’re using non-parametric memory to augment that generation, the hallucinations that appear are even harder to spot.

Michael Schiff: I don’t know if it’s in the paper, or if it was elsewhere but it was the example of like, how many calories are in an average apple and an average serving an apple has a thousand calories. And if you don’t know anything about nutrition, you’d be like, okay, cool, that sounds good. But you look back at it by a thousand, it sounds so accurate and right that I think that that’s where it comes back to these. You know, the sequence models are generating a probable output sequence. Given the input sequence. And if trained well, the most probable output sequence is going to align with reality. But the degradation of that is going to be something that doesn’t align with reality, but does look possibly human. So I think it’s interesting. The way we as an industry talk about hallucinations. It’s like this black and white thing, when really it’s sort of like the continuum of what these things are expected to do.

Aman Khan: Exactly. Yeah. I mean, I feel like, that’s like, kind of just getting to some of the wrap up actually, so it says it makes it hallucinate less with generations that are more factual and offers more control. It generates hallucinations that sound less incorrect is basically what they’re saying here, And I think that’s a really interesting nuance. It’s something to consider, for the folks here are the researchers here as you’re looking at rag models, we’re I mean, we’re looking at the canonical paper here. I think another interesting, just sort of insight is like they’re talking about. GPT. These are older models. Parametric memory has gotten even better. So you could imagine that. Okay, even if your retrieval is good. The parametric memory is solid. It knows enough about the world to reason about the context that you’re providing it.

The hallucination is going to sound even better, even if you know the top document retrieved is relevant. It’s something to keep in mind. I thought it was pretty fascinating that like, basically. What they’re saying is that not in parametric memory just augments how good the hallucination might sound, so I think that one came out of the video discussion with the first author here when someone asked him or around that. So any more sort of thoughts as we kind of round this out and kind of open any questions of folks happening in the community.

Michael Schiff: I’m happy to pass it off to the community if people will have questions.

Aman Khan: I guess, while we wait, what’s kind of interesting as well is, we talked a little bit about you know this canonical paper. What are some other interesting applications you’re seeing of RAG today? I think we talked a little bit about index. yeah, offline. What are your thoughts on index and more modern approaches to RAG?

Michael Schiff: We’re seeing a lot of people take product documentation and use that as the index. I mean, I was even the example I pulled off in terms of a car usage manual. But I think you’re seeing a lot of approaches where it comes back to this idea of hallucinations again, and like what these models are really good at and we’ve seen it, their ability to to summarize work really, really well. And so the degree to which you can turn this into a summarization problem where, rather than hallucinating an answer based on documents. It’s summarizing documents retrieved and providing an interface onto a pre existing problem like your experience with the product documentation before might have been like a keyword search, and then being presented with a list of answers that, like you, you are probably pretty good at skimming through the words in that summary. That’s truly just an extractive summary. But you know, a layer on top of that that provides humanized summaries and and guides you through that process which is really a process you’ve already been doing but more easily, and without having that kind of

I don’t know if you’ve ever watched somebody else run a search engine query, who’s like not necessarily doing it all the time. And it’s like they’re clicking the wrong links. And I can imagine, you know, assisting that process. And

Aman Khan: yeah, it changes. It changes your interactions. It changes what the queries even look like, right? Because you’re now no longer optimizing for how Google returns the top index result, you’re actually framing a question and expecting an answer. And so I think that has some implications and like design language designing for these systems where it feels more like you’re interacting with an expert or a human on the topic. And it’s just something to consider if you’re building one of those systems that understanding what’s being retrieved, and how you know, token or sequence that’s being generated is pretty important. So I think it was pretty interesting to die deeper here because you assume like, yeah, I hooked up my, you know my agents to my vector database. And it’s gonna pull a top document and give me relevant results. But even then, there’s so much weight on the parametric memory. That’s been augmented. That understanding how it works is, I think for me it was a pretty big takeaway. Actually, that it hallucinates correctly is basically the takeaway answer questions correctly, even when the correct answer is not in retrieving documents.

Michael Schiff: Yeah.

Aman Khan: A lot of fun to read this one. I think we’ll probably do some more on retrieval systems. And you know, context augmented generation in the future. We got to talk about the foundational paper on the subject. Thanks for the time, Michael. And thanks, everyone.

Share

Suggested reading

Text reads: The Illusion of Thinking Understanding the Strengths and lImitations of Reasoning Models via the Lens of Problem Complexity

The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning

Reads: Accurate KV Cache Quantization with Outlier Tokens Tracing

Accurate KV Cache Quantization with Outlier Tokens Tracing