Embeddings: Discover the Key To Building AI Applications That Scale with Zilliz, Creator of Milvus

Frank Liu, ML Architect at Zilliz, discusses how embeddings can help you build generative AI applications that scale really well into production. This talk was originally given at Arize:Observe, in April 2023.

Frank Liu: Hey everybody, thanks for coming to my session today. And today I'll be talking about how embeddings can help you build generative AI applications that scale really well into production.

And first, a quick introduction myself. My name is Frank, Director of Operations, ML architect here at Zilliz. There's my socials down there as well. So feel free to connect with me in any way. Happy to take any questions or any comments, concerns that you might have as well. This is a little bit of our company as well. And again, our socials are there. a little bit longer if one of you guys want to connect or join our Slack channel as well.

All right. So first things first, I'll give a quick overview of what I'll be, what I'll be talking about during this session. The first is a very, very brief introduction into autoregressive language models. And I know, you know, there's been a lot of hype and there's a lot of folks who really know a lot of detail about this, but this is more just a gentle introduction or a gentle reminder for what's how a lot of these autoregressive items work in case folks need them. Then we'll be talking about the CVP framework and the CVP framework is really something to help mitigate a lot of hallucinations and a lot of these problems that autoregressive language models have. I'll talk a little bit more about hallucinations, why they come to be in the first section as well. Then we'll do a deep dive into Milvus.

Now, Milvus is the world's most popular open source vector database. And again, during section two, you'll see how Milvus fits into the CVP framework and what it can really be used for. And finally, I'll sort of end off with a quick demo. Hopefully that'll be pretty interesting for you as well. So yeah, autoregressive LLMs, right? Or large language models or foundation models as a lot of people like to call it as well. And what are they, right? And what are they and what are they good for? I won't really talk too much about the applications or sort of the layers above these large language models but we can, we definitely know that there's a lot of craziness, you know, some very, we're in a huge hype cycle around chat GBT, AI models, right? And there's also a lot of other ones coming out as well, Claude and Bard are also currently in experimental mode. And there's many, many more that will be coming out this year as well.

And it really, really helps to sort of see how we got to where we are today. And if you look at where deep learning started, a lot of it was really centric around computer vision. The idea that if I take what's called here, what you see here, sort of a context window, convolutions really attend to nearby positions within your input tokens. That's a pretty simplified view of how convolution works. And again, convolutions are typically applied to computer vision applications, but this can give you sort of, hopefully this gives you insight into where we came from and where we're going, right? So convolutions have this very, very sort of a fixed length. They have a very, very small context window. attention inside a neural network that is trained with self-attention, or that has self-attention blocks, the context is fully global.

So you can see here, if I go back to this example, Milvus is the world's most popular open source vector database. Database is missing. You'll see why in a second. You can see that portions towards the end of the sentence, it's very, very difficult for them to have context relative to the earlier portions of the sentence. And self-attention is really something that helps fix that, right?

So again, we have this global context and you might be wondering why the rest of the sentence isn't here. It's because I got tired drawing all these arrows, so I apologize for that in advance. But moving on from self-attention, if we look at sort of how a lot of these generative models work, or how they create, how they're able to generate text so well, they use something called causal attention. And causal attention essentially is where words that are later on in your sentence or can attend to positions that are a little bit earlier. So in this case, world can attend to Milbus, but not the other way around. Otherwise, it wouldn't really be a generative or a cause and effect type of model. And a key takeaway from all of this is that a lot of these generative models are stochastic. They're probabilistic. They're used to predict future tokens.

Now, if we go here, for example, we use Milbus is the world's predict what might come after that. In this particular example that you see here, Milus is the world's most popular vector database. It could also be a vector search engine, vector embedding engine. These are some of the potential tokens that could be used to fill in the sentence that you see here. And a major downside to all of this is that it introduces hallucination ,there's sort of plausible sounding but actually factually incorrect responses generated by LLMs, generated by these autoregressive models.

And you know, in this, excuse me, and these hallucinations really are a pretty big problem, I would say, for a lot, for really putting a lot of generative AI, text AI applications into production. I know a lot of folks have, a lot of folks enjoy having a mathematical foundation to how some of these things work. And essentially the goal here is that we have some tokens and we wanna predict the next token after that. And what generative AI models do is that they simply model the probabilities that a particular token comes after your inputs have tokens.

So in this case here, the probability that my next token is a database, given Milibus is the world's most popular vector something, would be the highest. This would be something trained generative model is this is the probability that could potentially output. Right. And again, going back to the topic of hallucinations, I sort of stole this particular diagram from young Kuhn slides. A big reason, you know, if you look at these models from a stochastic perspective or from a probabilistic perspective, I think you can see why they hallucinate, right. You have these input tokens and perhaps you know, you want to generate a thousand words or 2000 words of text.

And all of a sudden, if you have one word that is wrong, your entire, your entire result can be a little bit off track. And that is really one of the, I would say this sort of, uh, combined, you know, sort of increasingly increase, excuse me, this combined reduction in probability of the output being correct is really one of the reasons that hallucinations occur. Right. A big way that, well not a big way, excuse me, one way that we can use, and I would say the main tool that we can use to solve the problem of hallucination is something that we call the CVP framework. And I'll talk a little bit about what CV and P mean in the next couple of slides. But first I wanna give an example of what hallucination is, right?

So if I query chatGPT, or if I query, let's say, any other autoregressive language model with, perform a query using Milvus, Milvus being a vector database, which again, we'll get into. You'll actually see that the answer looks right for the most part, right? But it actually isn't. And the reason is because in Milvus 2.0, there is no notion of a Milvus client. And instead what's happening is that the connections are done via connections, right?

The solution to hallucination, and again, this ties in very much with the CVP framework, knowledge into these LMS. And what I mean by domain knowledge is that if you are, let's say, a financial institution, or you're in the security sector, you may want to create, let's say, an internal chatbot, or you may want to create something that is really targeted towards the application that you're trying to build. And by injecting, by finding relevant documents, by finding documents that are relevant to your query input, and having that as prompt in, let's say, or anything else, that can help you significantly reduce hallucinations and give you better results overall. So let's go back to the previous example here where there's clearly, you know, Milvus being hallucinated here. If we instead put it in this demo that I'll have that I'll show you in a little bit later on, we can see that actually fixes the problem. And now we use connections as one example of how we can use domain-specific knowledge to really fix a lot of these hallucination problems that ChatGPT presents and some of these other autoregressive models as well.

OK, so now I want to talk about this CVP framework. And this is really, I would say, the bread and butter for what I think a lot of future applications that leverage LLMs is going to look like. And the key idea here is that we can of these apps built atop larger language models or autoregressive language models as a general purpose computer. And hopefully this will be a little bit clearer in the next couple of slides. But the first thing that you need is a processor. On top of the processor, you'll need some persistent storage and you'll need code that uses, that takes what's in that storage and does some useful work, some useful computation over it. And this is really where C, V and P come into play. T or any other large autoregressive language model. And you can really interpret ChatGPT as a processor, really interpret this model as a large processor block for the CVP framework. V is any vector database, millis being an example of a vector database. And this can be interpreted as a storage block for CVP. And you might want, let's say, you may want some other cache layers as well. We can get into that a little bit later.

P in this case, is prompt as code. And this essentially just means that instead of in a traditional computer, we might have machine code, zeros and ones that represent your instruction set architecture. In this case, we, the code is simply natural language. And we use natural language to really make the processor do what we want it to do. So that is C, V and P, right? And how is this typically implemented in practice? I'll sort of, Hopefully, this will be much clearer as we progress through this presentation here. But the first thing that you would do is you would have a knowledge base. You would have a particular set of documents. Those would go through some deep learning models, and those get turned into vectors and use a vector database such as Milvus to perform near-snaper search. Vectors that are closer to each other are more similar or they are more relevant to each other.

In this case, you can imagine if I'm building an application on top of chat GPT, I'll have and responses and you can really map those queries to each other much better using embedding models, right? Now, this is a very, very simplified view of things. For example, you would ironically for an application that utilizes CVP, you wouldn't wanna use autoregressive models to generate the embeddings. You would wanna use something that's trained either using MLM or some other bi-directionally regular self-attention, right?

So this is a very high level overview and also a sneak peek for the demo that I'll do a little bit later. But this is something that we built internally here at Zilliz called OSS chat. And OSS chat I think is essentially a manifestation of the CVP framework. Essentially what happens here is that we have first, we have the vector database, Zilliz cloud. It is storing a lot of project information, storing a lot of information, excuse me, about open source projects. And where do we fetch this information? It's from GitHub, from the documentation, from a variety of sources, of publicly available sources about that particular open source project. These docs are then chunked. They are turned into embeddings and then stored inside of Zilliz's cloud, which is based off and then these then go into a pipeline where a user will ask a question. That question will get matched to a lot of the relevant documents inside of Zilliz's Cloud. Zilliz's Cloud will sort of find more or less the documents that pertain to the answer, and then ChatGPT will, using a lot of the relevant documents as prompt, it will figure out how to best respond to that question whatever knowledge that it was trained on. Right? So that's really OS's chat as an application in a nutshell. And as I mentioned, just a tiny bit earlier, OS's chat is a manifestation. It's a great example of the CVP framework that we were talking about back here.

Okay, awesome. So now that we know how vector databases fit into the CVP framework, how CVP can be used to really mitigate a lot of hallucinations that you see with large autoregressive language models today, I wanna do a bit of a deep dive into Milvus. Now, I won't spend too much time here, but really, hopefully by the end of this section, you'll have an idea of what Milvus is and how you can use it to build a production ready LLM application, right? And production ready is something that we put a very, very strong emphasis on here at Zilliz's and within the Milvus community as well.

So quick overview of what Milvus is. Milvus is an open source vector database that is purpose-built to store, index, and query large quantities of embeddings. There's a link to Milvus down there. I highly encourage everybody to go take a look at some of our docs, download Milvus. It's free to use. Go ahead and play with it. And a key distinction between Milvus and Zilliz's is that we are key backers of the Milvus project and Zilliz's cloud, which is the managed service, is based on top of Milvus. So it's by many, many months and years of engineering work that has been put into building a solid production-ready vector database such as Milvus. And I want to talk very, very briefly about the Milvus architecture.

Again, I won't go too much into detail here. But essentially, you have four different layers. And I'll go over each of these one by one. Hopefully, by the end of these couple of slides, you'll understand why Milvus is production-ready, why we designed it the way that we did. So this is the Milvus architecture. The first thing that you'll notice is the access layer, and the access layer is really there to take user requests and route it to the right place. So I won't go too much into here into DDL, DML, DCL, but just know that the access layer is used to route your request to the right place inside of Milvus. We have a coordinator layer here as well. Each, there is one coordinator service of the different types of worker nodes. So we have query nodes, data nodes, and index nodes. And depending on the type of application that you build atop your LLMs, you might want to increase the number of queries that you do, or you might want to increase the number of inserts that you do. And we can do that. You have a very, very flexible way to do that inside of Milvus as well.

The worker layer, as I mentioned, is composed of these different types of nodes that are stateless and do individual things. As you can imagine, query nodes are there and index nodes are there to do the indexing. And there's a storage layer that is built on top of cloud native platforms as well. So for example, we have object storage based off of IOS 3. And then there's also message passing inside of the Milks factory database as well. Again, I won't go too deep into this, but that is one of the main components there. So what are the key takeaways from this section? The first is that there is a single coordinator instance per service type. And this is really a big reason why Milvus is so flexible and why we've seen over 1,000 enterprise users across a variety of industries that use Milvus today. It's because of that flexibility. Data in Milvus is stored in collections. Collection can further be divided into partitions. And we have very, very explicit disaggregation not just of data and compute, but of the different types of compute as well, of querying, indexing, and data. Message streams are also really core to Milvus, and that's what allows us to do both real-time as well as batch processing. It's one of the reasons why Milvus has such a powerful architecture. All right, so now that we've gotten all of that out of the way, I do want to spend a little bit of time on a quick demo as well. Now, this demo will leverage OSS Chat. And as I mentioned a little bit earlier, OSS Chat, the architecture for OSS Chat looks a little bit like this.

And again, to give everybody a bit of a refresher, essentially what OSS Chat has done is it has indexed a lot of documentation, a lot of code, about these open source projects. And those open source projects then get, when a user will ask OS's chat a question, it will, those open source projects, the data on those open source projects will get queried from a MILVS instance, in this case, from Zillow's cloud. They'll get passed into chat GBT as a prompt. ChatGPT will then answer questions about that. So I'm gonna stop sharing this window right here. One second. And. Let me share my other screen instead. All right. So. And a quick note to the editor here, I realized I forgot to bring my questions or my prompts that I wanted to use here. So it would be good to cut out this section where I actually go grab those prompts from my desk. So give me one second.

So I'm about to resume here in three, two, one.

So this is the interface for OSS Chat. And what I wanna do here in this demo is just do a simple side-by-side comparison of OSS Chat relative to GPT 3.5 or chatGPT here in this case, right? And with the particular architecture that we've set up, OSS Chat is available for everybody to use. As I mentioned, I highly encourage you to come online, try it out. And what I'm going to do is I'm going to ask OSSChat and chatGPT two different questions about open source projects. And really we'll see how the CVP framework can be used to fix a lot of the existing issues around hallucination, around lack of domain knowledge that we see in these generative AI models. So the first thing that I'm going to do is I'm going to ask OSSChat about Langchain.

I'm sure a lot of the folks listening to this presentation have heard about Langchain. But if you are unfamiliar with it, Langchain is essentially an open source project that you can use to build a variety of different LLM applications. So the first thing I'm gonna do is I'm actually gonna ask ChetGBT, what can I build with Langchain? Whoops, let me refresh here. And as you'll see here, Lang chain. it doesn't quite get it 100% right.

So when I ask it, what can I build with Langchain? It says that Langchain is a new technology that combines the power of blockchain and natural language processing. A big reason why ChatGPT doesn't quite get it right here is simply because it doesn't have the right amount of knowledge. It doesn't, it wasn't trained on knowledge from the point that Langchain was created. I take the same model, which in this case, as I mentioned before, is GPT 3.5. I find relevant documents about open source projects relevant to Lang chain. I can then get GPT 3.5 to answer the question much better. So let me do the same thing. I'm gonna type the same question except in OSS. So let me do the same thing. I'm gonna type the same question except in OSS. So let me do the same thing. I'm gonna type the same question except in OSS. So let me do the same thing. I'm gonna type the same question. So it'll think for a while here as it retrieves the relevant documents. And in this case, it gives a much, much better, albeit pretty concise response.

So you can use Lang chain to build a variety of applications as I mentioned, such as QA, such as different chatbots and a variety of other apps as well, right? It is a bit concise in this case here, but depending on how you prompt it, you can get a lot more information as well. From here, the second example that I want to give is related to Milbus. And it is very, very similar to the example that I gave earlier, but I want to show it to everybody here in action. Now, chatGPT or GPT 3.5 turbo, however you want to call it, it has a lot of information about Milbus internally as well. And it has relevant documentation too, but in this case, because of, you know, it just doesn't quite get it right. So let me type the prompt here. where I'm going to ask GPT to tell me how to query a Milvus vector, an instance of Milvus, a Milvus vector database. Very, very similar to the example that I showed above. I'm asking it to show me some Python code as well. And as you can see, unfortunately, it doesn't quite 100% get it right. Milvus, in this case, the interface with Milvus is actually done through connections and not a Milvus client, as you see here.

If you were new to Milvus or you're figuring out how to use it, this sounds very, very plausible. It sounds very, very reasonable and you might say, hey, I'm going to use this code. I'm going to try it and hopefully it works, but it won't quite work, unfortunately. If I inject GPT 3.5, chat GPT, with more domain knowledge about Milvus, which I'm going to do in OS's chat here, I'm going to type the same exact query. do the exact same query here, you'll see that it actually gets, or it gives a much better response. So wait for it to finish thinking there. And again, as I mentioned a little bit earlier, comparing these two side by side, you'll see that while ChatGPT gave me a response that involves a non-existent Milvus client, even though all of this sounds very, very reasonable, in this case, OSS Chat, which essentially leverages vector databases, leverages Zilliz's Cloud and Milvus plus ChatGPT, it gives me a much, much better response. It actually gives me in this case a correct response as well. is the connection object from PyMilvus.

Now, these are just two examples of a wide variety of ones that you can use on OSS Chat. And now, we're continuing to add projects every single week. So if there's any favorite project of yours that you want to see here on OSS Chat, let me know. Shoot me a message, and we'll see what we can do. One last thing that I want to mention about OSS chat as you use it, you may find that some responses are much, much quicker than others. And the simply the reason for that is because we have integrated something, a new project of Zilliz is called GPT cache, one word GPT cache inside of OSS chat that allows it to cache some of the most commonly asked questions. So if you see in some instances OSS chat responding very, of why we've done that there.

If anybody's interested in GPT cache, which is essentially a semantic cache for large language models, feel free to Google it. Or feel free to get in touch with me as well. Happy to send you a link to the open source code that we have on our Azul's repo.

So that is the end of the quick demo that I wanted to do there. I'm going to stop sharing. Go back to my slides. And that is really the end, right? Again, as I mentioned earlier, OXSChat is available for everybody to use. I highly encourage you to go try it out, play with it a little bit, would love your feedback, and let us know if there are any other open source projects that you wanna see on this website as well. So thank you everybody for bearing with me through this session here. Feel free to visit zilliz.com, or if you're interested in Zilliz's cloud, to learn a bit more.

And yeah, happy to take any questions or any other sort of comments or concerns that folks might have as well. Thanks for listening.

Subscribe to our resources and blogs