HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels

Sarah Welsh



In this paper reading, we explore HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels. HyDE is a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages. This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness. 

Join us every Wednesday as we discuss the latest technical papers, covering a range of topics including large language models (LLM), generative models, ChatGPT, and more. This recurring event offers an opportunity to collectively analyze and exchange insights on cutting-edge research in these areas and their broader implications.


Dive in:



Aman Khan, Group Product Manager, Arize AI: Good morning, everyone. Let’s give it a few minutes for folks to trickle in here. While we’re waiting for some folks to join, maybe we can do just a quick round of introductions, just so folks know who’s reading the paper and what our backgrounds are. I can go first. My name is Aman. I’m a Product Manager here at Arize, working very closely with all the folks on this call on our product roadmap and then working with the engineering team to make things happen. My background is technical, working at machine learning platforms at companies like Cruise and Spotify. So yeah, super excited to dive into this pretty awesome paper. I’m joined by Adam, Michael and Jason. Jason, maybe you’d like to go next?

Jason Lopatecki, Co-Founder and CEO, Arize AI: I’m the co-founder here at Arize. I’m Jason. Great to see everyone. Go ahead, Adam.

Adam May, Engineering Manager, Fullstack, Arize AI: My name is Adam. I’m managing the full stack team here at Arize. So we maintain the platform and build out new features. And I started out as a journalist actually but for the last 10 years I’ve been working in tech.

Michael Schiff, Chief Technology Officer, Arize AI: I’m Michael. I’m the CTO here at Arize. I helped start the company with Jason and then have been working on distributed systems and machine learning systems. 

Aman Khan: Hopefully, you’re a repeat attendee, but in case this is your first time, what we’re going to be doing is basically diving into a paper sort of dissecting it and our understanding of what the authors are intending to convey, taking apart some of the architectures and the data that goes into some of the models that we’re going to be talking about and then looking at some like real world applications.Where would some of these papers be applied? Where might they fall apart? With the intention for all of us to sort of learn as we go. So it’s definitely meant to be conversational. And think of it as really like, we’re just all over sort of reading this paper so I believe the paper link will be shared out. so you can follow along as well on your own screens. But I’ll go ahead and share my screen, and if you have any questions, just feel free to pop them in the chat, and we can sort of take them live as well. 

So today, what we’re gonna be reading about is kind of a fancy title: “Precise Zero-Shot Dense Retrieval without Relevance Labels,” that has been colloquially known as HyDE. So this is a pretty recent paper. Adam, do you want to give us just a quick overview of what this means? For someone who may not have much context on what this is.

Adam May: Sure, so just to start with, dense retrieval–the idea here is that you can use an LLM to perhaps you know the LLM may not be educated on the actual question you’re asking it but given  a corpus of documents that you’ve embedded, you can actually do some kind of search in order to retrieve the right document included in the context and get a much more accurate answer to the question you posed.

So zero-shot here means without any kind of fine tuning or training. You just give the LLM a question, it does its best to embed that question and find the most similar document for it and come back with its best guess for the answer. The challenges here are often that without this kind of fine tuning it doesn’t always do a great job of retreating the right document with the right context. And so you know, the solution to this is, there’s a lot of different strategies that people have used to give it the right context to make sure it retrieves the right context, whether that’s adding in another step to help rank the documents after they’ve been retrieved, or perhaps fine tuning the model or the embedding tool that it uses to create the space and make sure that they’re all sitting in the correct space together. But this poses a new option. Basically, the idea here is instead of just encoding the query itself, perhaps you can add a different set. and that step would be generating an LLM. In this case, and instruct GPT LLM to generate a hypothetical document that answers the question.

It’s not expecting it to generate a document that actually looks like it answers the question, or has correct information in it, it’s more about creating this kind of structure, and then, instead of embedding the query, you could then embed this document, which would give you a higher probability of getting a correct, relevant document that would look like it would answer the question.

And so it tries to basically go into how effective this strategy.The paper believes it is more effective than just zero-shot retrieval, without any kind of fine tuning or without any kind of relevance labels in general. but there are definitely some caveats, and they are pretty good about explaining the way through.

Aman Khan: Awesome thanks for that overview. I figured like, just like at a high level you know, there’s some interesting value–we talked about RAG a couple of weeks back, retrieval augmented generation which is sort of the seminal piece on context retrieval, for augmenting an LLM’s response. What’s kind of interesting is there’s a ton of development happening in the retrieval space. And this seems like, you know, it kind of makes sense about what if we threw an LLM at the retrieval component? Michael, could talk about if you’re building one of these systems like, why do you need relevance labels? What is the value of the zero-shot dense retrieval? Can you break down what those terms are in the architecture and where HyDE can help?

Michael Schiff: Well, so tense retrieval is that process of search via semantic embedding. Relevance labels to help you fit the encoders that define that problem. So a little bit later on, they have that equation number one that basically defines the dense retrieval process as a maximum inner product search of the two embeddings: the embedding of the query and or the embedding of the various documents. So in order to fit good encoders to minimize that maximum inner product, search, or to to train a system where the inner product of set vectors represents relevance, you need relevance to fit those encoders. And so they state as really like THE problem of dense retrieval and the zero shot cases, dense retrieval as learning those two encoders without any relevance labels. And basically, this technique is a way of sidestepping that problem by saying, Well, we’re not actually going to learn these two encoders simultaneously, we’re going to have a single encoder that’s trained to document a similarity which is a well-known technique for doing this without supervision. And just keep the whole problem in the document to document search realm. So it’s kind of an interesting way of sidestepping the problem which I find is often really successful if you can figure out a way to take a hard problem and make it a different problem that tends to work out well if you know an answer to that other problem.

Aman Khan: It feels less like, necessarily, the paper isn’t so much about the lack of relevance labels as much as like, here’s like a synthetic generation. For you know what could be like in place of a relevance label to do this. 

Michael Schiff: We don’t expect it to be factually correct. They actually expect that it will contain factual inaccuracies, but that it would capture the relevance. And you know, we can talk more about this later. But that’s where I have some thoughts about the edge cases of this and places where, in their comparison, they should start to perform less well but the output is capturing some relevant structure that you would expect to find in an answer maybe populated with incorrect details. And then they have the metaphor of the embedding of that document as a kind of compression. Where  you lose those inaccuracies. I wonder how much that is a good way of thinking about it.

Jason Lopatecki: And I wanted to add… It’s funny, I was talking to someone recently and it’s almost like all the LLM app people have discovered search and retrieval, and all these other people have been working on search and retrieval for decades and what’s interesting is that there’s kind of a sparse–when we send, say, dense retrieval–there’s a bunch of sparse techniques that have been around a while and in lots of different variations of those Tfidf, and then indexes for for word specific stuff on a large large millions and millions of plus indexes And in in and contrast that with like dense where you kind of compressing the information in what you’re looking for into an embedding or vector representing kind of all or important parts of the information of the document versus specific words. This is kind of the dense versus sparse worlds that is worth noting as we kind of dive into this–this is trying to improve and really improve the dense approach which is embedding-based.

Aman Khan: Jason, what’s the difference between, dense and sparse retrieval? I think that’s an interesting note to touch on that they reference that here as well like, you know, extensively studied after the emergence of preacher and transformer language when I was like, why, you know, what’s the difference, I guess, between dance, sparse approach to retrieval?

Jason Lopatecki: Once you train a model on a lot of data. a large model, you have the potential to have this. I mean in old form you’d call it transfer learning. But, like you, you then don’t need to potentially embed stuff. You can use it in areas to create embeddings that represent a lot of data. so I think there’s a, I think transformers empower you to create embeddings on lots lots of data. I think that’s the big takeaway.

Aman Khan: And then RAG that was like parametric, non-parametric memory, like apparently the cockpit was whatever is encoded in  the transformers. parameters, basically, all those embeddings that are used  to generate the architecture.

Jason Lopatecki: Correct, so you’re pulling out an embedding at the last layer that represents structure and the data in some way that’s captured into a vector in the space. Compare that to a more hand-drawn, sparse approach for searching stuff. It’s word based. You have a very different type of approach versus understanding the doc itself. 

Aman Khan: So interesting. So let’s take that. But instead of having to understand the doc, this hide sort of Instruct GPPT is like taking in some questions. You know, how long does it take to remove a wisdom tooth? And then you wrap that in a query and a prompt, and then ask the LLM to generate some piece of text that can then be used to do the retrieval to do the look up. So what’s interesting is as Michael touched on earlier, this doesn’t have to be like, factually correct, but it gives you a better look up than you could have in your knowledge base hypothetically, if you’re doing context retrieval, or maybe you’re working at like a dental company and you have a lot of documentation about teeth and wisdom teeth. But this one is asking specifically about removal. And maybe there’s some more information created from, you know, like the time it takes or the type of tooth or the process that can then be used, as you know, as a “relevance label” when coming time for that retrieval. So I mean, that’s like, you know. I thought that was a pretty good, real example, but you had one yesterday that was like baking a lemon lemon cake or something like that. 

Michael Schiff: Yeah, it was on my mind. I’ve been baking over the weekend. You bring up the case of searching within your documentation. I think it’s interesting the way that they cite some of the performance degradation as the corpus gets more narrow. And I wonder how much of that is because the relevance that you’re capable of capturing in the parametric memory of instruct GPT is is high level of structural, and that as you get more into the specificities of a search, you’re not looking for a document, the corpus of everything. You’re looking for a specific document in a corpus of documents that are very similar to it. Then that approach is going to begin to fail, because the items that make it relevant during that long tail and are going to be present in the factual details, not the not the structural makeup of the document. The recipe in the corpus of everything I give you an incorrect recipe that is structurally our recipe still is probably going to help you more than looking for a good recipe and a purpose of recipes. Where this structure is no longer indicative of high relevance.

Aman Khan: Basically the alternative that they’re comparing to is a fine-tune dense encoder basically means if you had your model trained on all of the data that you have that’s structured a certain way. HTML markup, recipes that follow a certain format, a certain language. This is a way to work around or operate around that. I guess. Jason or Adam, what are your thoughts on the importance of the structure of the text that you’re feeding in like, where is this like how the docs are formatted, like, you know, is that basically where this starts to become useful is just in capturing signal in the formatting and the structure of the text.

Adam May: Yeah, it’s funny, because if you go back to the first page, you do have a hidden prompt engineering problem here, which is how can I correctly ask for a scientific paper passage. I’m asking for something in Korean, or, you know, if you were doing docs like, can you write me a technical document in Markdown that might actually be able to correctly structure the document. You do get around one of the issues, which is getting a question to pull out an answer to it which are just very structurally different. You might not be able to get that kind of relevance between the 2 of them. So it does sidestep one problem. But I think Michael is correct in saying that there’s probably a limit to this, like at the end of the day all technical documents may not look exactly like, but have a pretty high structural similarity. So if you’re trying to do a search across, you know. for instance: there’s a bot that can look through the Arize Docs. And if you’re looking for a very specific question, a lot of those docs might be structured very similarly, have the same markdown structure, and you know, high level sections. And so you might not have as high of a success rate trying to generate that and document and and make sure that you’re getting the right one out of all of these elements out of all of the different options.

Jason Lopatecki: The magic of this paper is it uses that contriver encoder which is, I think of it as embedded to those of you in the LLM app space. But basically it takes a document, creates an embedding, maps to an embedding a link in space. That contriver in an example, gave in the beginning they have a comparative paper that fine tunes it or supervised, supervises it to a search and retrieval task in this paper they don’t do that– it’s unsupervised, it’s just a pure, embedding thing not done like you could fine tune it for the task here if you wanted but what they’re saying is, like it’s a real pain to have to like find teams, you’re fine tuning for every use case. I was reading through, wondering that myself, but that feels like the big idea there they were talking about like you, you know, by doing a larger training run on a large set of data  in creating a large, unsupervised, pre-trained model or semi large, you get around the issues with a small model. I felt like the contriver was kind of one area of interest. The other one was just like this section here which is like the generation of all the documents. I’m not sure what it looks like. They’re generating multiple documents here. They sample the Instruct GPT to get several hypothetical documents. They embed them with contriver, and then they take the average, and I thought it was interesting. They also add into that average calculation the embedding of the query.

Michael Schiff: So I think that’s where you end up wanting to fit a document encoder in a query. You don’t have to do you? You’ve assumed the way your relevance labels to fit that. So you know you embed it, and that query is going to look nothing like the document that answers it. 

Aman Khan: That’s a good point, it’s interesting. That’s a question that someone can ask, how does offloading relevance modeling from representation and learning model to an Lg model impact the effectiveness of this approach.

Jason Lopatecki: So the question there is with this generative approach, which is essentially what they’ve what they’ve taken. How does this make it better or worse? Like the final results relative to a more kind of: representation learning model it is basically just offloading the relevance.

Michael Schiff: And just using this kind of proxy for relevance. You’re no longer dating to learn two representations of things. They don’t have to learn a representation of the query where inner products, or where inner product encodes relevance with a representation of the document. We’re on equation five now, but if you go up to equation one. They say that the hard thing is learning these two representations simultaneously. So learning and encoding of the query and an encoding of the documents, such that the inner product represents relevance. And so they say, we’re just not going to do it. But I imagine if you have relevance labels and a wide corpus of labeled relevance data with question and answer pairs. You will do better if you fit two representations. I just think their point is that this is more real world like in general, you are not going to have access to such a labeled set of relevance data. And it’s the first step, the problem.

Adam May: And so many of the practical examples that we’re seeing of people trying to, you know, use this out like context, retrieval, and to create a document like a product that people will actually use are using fairly small corpuses. So you may not be able to generate the kind of relevance labels that you might want to see, to have the fine-tuned encoding that would, you know, actually be able to create a similar in between question and the doctors. And so this is a very practical approach. And that’s actually how I when I first saw this paper it was referenced by someone who is building LLM apps for the process of retreating over to small document Corpus. And there were several people who agree that this is a fairly effective way to do it. Given that you don’t have the ability to train, or you don’t have the corpus of information to train on. You can get pretty close to a fine-tune performance by just doing this process, which is in a lot of ways maybe less resource intensive than some of the ranking that I’ve seen on the other end of the session, where we have a bunch of documents and then rank them relevant to the question.

Aman Khan: I’ll jump ahead a bit to the conclusion. But they do actually talk about like, you know the trade off here, which is the concept of relevance, and how it is captured by the G model we demonstrate in many cases, hide can be as effective as retrievers that learn to model relevance. But the idea is like they’re not actually saying that you know, a weak retriever will suffice. As the analogy models rapidly become stronger. They like this, this actually works. For now, if you’re just getting started to sort of jump, that step of having to to, you know, train that retriever train, that embedding generator. But you know, it’s not really going to solve your problem the whole way? Most likely.

Jason Lopatecki: Do they have just the pure question embedded like the corpus of documents that they included for search?

Michael Schiff: I’m sure they make it available. But I haven’t had a chance to check

Jason Lopatecki: The thing I would like to see is the thing people are doing today, which is like, embed your query with no additional fine tuning, use the same document-to-document encoder. So you assume the same document, encoder as query, encoder and code the query directly, and do their product search on that? Because they’re doing it, they’re mixing the query, and they do. Maybe it’s not even a weighted average.

Michael Schiff: It’s a straight average. You’ve had that highlighted at the end of relevance being a statistical artifact. I think it’s sort of a perfect analogy to try to map these techniques on to human processes. But there, it’s been interesting to do in the Minecraft Voyager paper, and here it’s reminiscent to me of giving a set of documents from a Google search. And you know you have an instinctive sense of what is relevant before even really digging into the details. You  know what a relevant document is going to look like, and you feel that when you see it, if you’ve done a lot of searches, and this feels like a way of modeling that intuition and search. You know you. You have an idea of the shape of what you’re looking for. You don’t know the answers that will contain that. But you can. You can tell the document when you see it.

Aman Khan: We’re running some experiments or like taking a look. And what’s interesting is like the queries that they have here are, like, you know, actual questions. How long does it take? How long is it? But like oftentimes, if you’re searching, or even like how you type into Google. I think that how people interact with search now is different. Right? So I mean, if you’re talking to an LLM versus to just like, you know, expecting to search over an entire corpus like even how you generate that initial query and what you’re supposed to do with that. What this Instruct GPT is supposed to do with it is so interesting to me as an interaction. Because, you know, you could, you could say, well, you why don’t you summarize this first? But don’t you generate some intermediate generated step, and then create a generated document off of that, like, you know, interpret what the customer or what the user is asking, break that down, provide more structure and then feed that in. And maybe that gives an even better, higher quality generated document. And I mean, it starts to look even more realistic, you know. So I don’t know it. It’s interesting that we’re like moving further and further up in the zero-shot role by sort of breaking down, trying to decompose tasks of retrieval into like outlined steps. 

Adam May: in a way, in some way it makes a lot of sense as well, because, you know, that’s what the way that it would look for it. Like you’re generating the document, using the kind of parametric memory that it has over what this document should look like, so that that kind of its own intuition of what? What would be relevant is somewhat encoded and hypothetical. And it’s not. It’s not perfect right? Like, you know. If you ask how long it takes to remove a wisdom tooth. You can say it takes between 30 min and two hours, and then, when it says it takes 20 min that relevant search might come through. But I am curious to see whether the less it would have kind of like straight answers to something where it’s more like, okay, this is actually a very large step by step process to construct this Api query like, maybe it wouldn’t have the type of relevance encoded into it to be able to generate a relevant, hypothetical document, and that did bear out a little bit in some of the results.

Jason Lopatecki: It is worth noting, by the way, that LangChain does have a hidden version here. Some of the contriver is at a into embeddings from OpenAI. So you can actually compare, you know, an approach pretty easily using LangChain, which would be interesting to see. So there’s a hide retriever, if you have up and have a LangChain in production, and you’re doing, you know, query, retrieval. There’s an option to to swap this out and and and test

Adam May: Does it say what it uses? Or can you swap out which model it uses to generate the hypothetical document.

Jason Lopatecki: I think that the embedding store, the one you decide for embeddings is what or the hop. Yeah. So there’s the Embeddings one which is the contriver. And you can swap by default right? So in your generator? 

Adam May: Yeah, because one of the more interesting pieces from the results table is that the better the like the better the Llm. You use it to generate the hypothetical document that actually impacts the performance pretty heavily. So this could also be one of those things, for, like as LLMs continue to get better. This method might also continue to improve as well.

Jason Lopatecki: Yep. It’s also worth noting that all you know from the traditional space. There’s a score like there’s Marco data data set. there’s the beer score. there’s a lot of stuff where the traditional side has done a lot of testing which I feel like is, you know, not brought up much in the current space. you know a lot of it. And I think a lot of people are just trying stuff out and seeing if they feel like it works, and kind of filling that breakage before trying to go do a rigorous test on things. but I think there’s a question in my head like, should you even use this hide? Or should you use? How good is it relative to the basic query embeddings people are doing. 

Michael Schiff: I would be interested in a test across a more diverse set of document corpuses. My intuition is that the success of this technique is going to depend significantly on what you’re looking for, and the corpus you’re looking for within.

Aman Khan: We got a question that is somewhat related actually, which is: I’m curious if this approach is worth experimenting, instead of giving supervision as pairs of query and documents maybe are able to just give a set of documents and influence the generated document. Given a query to be closer to the set of documents than the search space. So yeah, that’s basically, you know what it you know, kind of just running experiments in the space and seeing which one works for your application, and depending on how dense that you know that, how, how your query and betting is already look like you know how to set spaces.

Jason Lopatecki: I did want to give a plug, for we do have a Pinecone, LangChain, Arize workshop tomorrow. So maybe we can drop the asks of the LangChain team if they test it, hide or whatever. But anyone who is interested in a little bit more of search and retrieval debugging and troubleshooting. 

Aman Khan: Yeah, we’ll actually be building a system from scratch. So if you don’t know where to start. We’ll have a notebook that will be shared out that you can actually run through with those in that. And we’ll actually build the system, test it out and then actually query over it. Well, any more sort of takeaways we want to touch on here, I think. Like you know, 1 1 1 one interesting point. Actually, Michael, you brought up yesterday when you were talking about this was. you kind of view this as like initialization on algorithms.If we’re running an experiment, you want to put the bounds on this like alternatively when you feed in this when you’re feeding in a query that the Instruct GPT doesn’t use the query at all. And instead, it’s just like, generate a random document. Why is this better than just generating a random document? 

Michael Schiff:  I mean my connection to an initialization process is, I think, sort of related to Jason’s thought of, you know, if you’re going to assume. I don’t have information to train two encoders to represent relevance in the inner product calculation. I just have my my pre-trained contriver that can represent document to document similarity. How much better is this than just taking the encoding of my query? And in my mind is like random initialization. Just encode your query. And then if we’re going to assume okay, I can’t fit a query here. And just in code. My query is not performing so well can I? Can I initialize the process by which eventually I hope to have no defense and fit a better representational model. But can I in the meantime, kind of initialize my problem and start my search from a guess of at least what a document might be shaped like, if not what it contains.

Aman Khan: Are there any other retrievers that seem interesting for folks as well like this? This type of retriever mechanism like any other retrievers that that set out. We kind of talk about it. But it was, you know, given supervised pairs of query and documents basically having those relevance labels. Why would you use this approach over that and then maybe, what are some other retrieval approaches?

Jason Lopatecki: Yeah, I was going to also mention that. But there are systems already built that people have that are like elastic search, or something like you possibly could have some relevant documents, and then maybe generate and and get yourself closer to it. So there might be ways of using systems that are already in place versus the peer-generative approach, or or using those in, you know, in conjunction with which might have been the hint there.

Adam May: Yeah, they do talk towards the end of the paper around this being like a good intermediate step, one you would find once you have this in production for a while, and you get a large corpus of data around questions. The correct, relevant document. What the correct answer is you can continue to find to this process and every different step. When you find an encoder for queries and an encoder for documents separately, or perhaps something that would be interesting is just encoding the one unified encoder for both of them, so that you get more relevant answers. So it would be like, I don’t think they’re seeing this, as this kind of this is the one process, and this is what’s gonna work for every stage of maturity. But as a bootstrapping mechanism to get you to the point where you can find two things. It might be a very effective strategy. And I do think it’s interesting to think about what a comparison I would like to see is whether you could, you know, if the performance between different levels of LLM for generating the document is so great. What would the performance be for? You know we’re not fine tuning the encoders, but we’re instead fine tuning, Instruct GPT on the existing corpus of documents. Would it get better at generating hypothetical documents that would then in turn provided a stronger relevance mechanism from driving the right document? So I think there’s a lot of options and mixing and matching that we’re probably going to see.

Jason Lopatecki: Yeah. I can’t help but think you know the number of LLM calls. We were doing a search and retrieval with LamaIndex the other week. And we’re like, oh, maybe you should have an email after your retrieval. And now we’re gonna call out and do this generation before the retrieval. You know, I hope these companies are making these fast and cheap, so that you know this, this and I know they are. But there’s a lot of calls just to get an answer.

Michael Schiff: You do wonder at what point is the cost benefit. You just train it to work correctly, or even apply a sparse approach that is not as sophisticated. 

Aman Khan: I’ll say, like, actually just building off that like the authors do say, actually, Jason, because it’s relevant to the index example you gave and index. You can rerank the, you know, a retrieved context or a few documents to generate the response or to feed into the contacts for the LLM. They basically say, just compare it to that. And if you don’t have relevance labels, this is a good alternative to having those relevance labels or generating. It’s labels. So yeah, I think maybe it would be interesting to do some experiments. See how it works with re-ranking over versus, you know, generating this hypothetical document?

Jason Lopatecki: Yeah. If anyone’s listening, and in academia, there’s probably some research on which retrieval message should be using.

Aman Khan: Yeah, this is a good one. Next week I think we’re doing one on another model architecture. I think it’s GLoRA. So building off of LoRA should be pretty interesting, and feel like the architectures that come out every week or every few weeks really feel like leaps forward for a pretty exciting time to be in the space. So definitely join us next week for that.