Arize:Observe 2023

How LlamaIndex Brings your Data to LLMs

One of the key questions that everyone has when they start playing around with LLMs is: how do we best augment large language models with our own private data? Jerry, Co-founder and CEO of LlamaIndex, discusses how LlamaIndex brings the power of LLMs to your data.

Jerry Liu: Hi, everybody. My name is Jerry, Co-founder and CEO of LlamaIndex. And today the talk will be how LlamaIndex can bring the power of LLMs to your data. And so I'll share these slides in the community Slack channel afterwards. And if you have any questions, feel free to drop it in the chat and I'll hop on for a live Q&A session afterwards as well. Great.

Let's get started.

So a lot of you might know that, you know, language models are great. They're a phenomenal piece of technology for knowledge generation and reasoning. Everybody's chatGPT these days, and people are using it for a ton of different use cases. And the other thing about these language models is that they're pre-trained on just massive volumes of publicly available data. So everything from Wikipedia articles, everything you can find on the web, to a bunch of just information that's more or less in the public domain. As a result, you can use them for a ton of different types of use cases, for instance, question answering, text generation, summarization, and planning.

But I think one of the key questions that everyone has when they start playing around with LLMs or large language models is, how do we best augment large language models with our own private data? And so if you think about whether you're an individual or an enterprise, as an individual, you might have a collection of different private notes and files on your computer. And if you scale this up to an enterprise setting, you have a ton of different workplace apps that you're using, whether it's Notion, Slack, Salesforce, of data lying around in your data lake and it's very heterogeneous. So whether or not your data is structured in, for instance, like a SQL database, whether it's an object storage like AWS S3, or, you know, even if you're using a vector DB or document store, you have a ton of data that's lying around that's private to you. And one of the questions that many people are thinking about is how can we get a technology like chat GBT or any language model to understand our own private for use with all these downstream use cases like question answering, text generation, summarization, and planning?

So these days, there's a few paradigms for trying to insert knowledge into a language model. One is fine-tuning, which is really about baking knowledge into the weights of a given network. So for instance, if you have this knowledge corpus of text data, you could initiate some sort of training process, whether that's Gradient SN, reinforcement learning, like RLHF, or any other sort of optimization process to make sure that this knowledge is actually encoded in the weights of the network itself. And so what this means is that you're basically modifying the weights of the network to incorporate any new information. These days, this notion of fine-tuning or any training process, like distillation, has certain downsides. And there is a belief that it will get better in the near future, in which case that would be quite an interesting separate topic to discuss. But at least these days, there exists a little bit harder to adopt.

One is the fact that you need to spend some effort actually preparing the data in the right format in order to fine tune. Another part is that there is a certain lack of transparency when you initiate this optimization process to incorporate knowledge into the weights, because it's really hard for users to actually peek into the weights to see whether or not the knowledge has been included or not. Another just general downside is, especially if you're a casual user without a ton of ML experience, just trying to fine tune on some initial datasets, you don't necessarily have the tools to understand whether or not it's working well or not. And the other part is, it can be quite expensive if your data volumes are pretty large. Another paradigm for inserting knowledge is this notion called in-context learning, which is basically about putting context into the prompt. And so this is less about learning and more about, how do I find the best input and prompt engineering in order to make sure that when I send this input prompt to the language model, it has all the context information that I need, and I can get back the output that I would want. So this is starting to become pretty common for builders in this space. of the main focal points of Llama index at the moment, where let's say you have a large knowledge corpus.

Say in your Notion database, you have just a lot of text documents, and in this case, let's say it's a biography of an author, and specifically Paul Graham. So this is an example essay dump of one of his essays, and let's say we want to ask a question over one of these essays. What we would do is we would have a retrieval model that's able to retrieve information from put it into the input prompt, put it into this section right here, and then you would structure the entire input prompt template as follows. So it would look something like here's the context, and then the context would be inserted here, you know, before college the two main things are, and then given the context answer the following question. And let's say the question is, you know, what did the author do growing up? You would pass the entire input prompt into a language model for inserting knowledge, it turns out it's actually quite easy and straightforward for a lot of people to use because you just put stuff into the input prompt space. But there are a few challenges, and one of Lama Index's key tool sets that Lama Index offers is a way to resolve these challenges and help you scale up to larger corpuses of your data. So for instance, a few of these challenges are, one is, how do you actually retrieve the right context for the prompt? Given the question that you have or the input task that you want to solve, you have a large knowledge corpus, sometimes gigabytes or even terabytes wide. How do you actually retrieve the relevant context for the task at hand? And how many tokens are you using in the process? How do you actually deal with long context? Let's say you want to perform a summarization task over a long document. If this document is too long to actually fit into the prompt window, what are strategies for dealing with summarization as a whole? Next is how do you deal with unstructured, semi-structured, and structured data? Data can be very heterogeneous. They could take on a very structured format. They could be JSON files. They could just be unstructured text as well. They could also be multimodal like images, audio, and more. So how do you incorporate all this data and somehow index it and structure it in a way that you can actually feed into your language model? How that's potentially very large, like gigabytes or terabytes big, and how do you trade off between performance, latency, and cost? So that's one of the core goals of LlamaIndex, which is an interface between your data and your language model.

Our goal is to make this interface fast, cheap, efficient, performant, and easy to use. And we want to satisfy all these different dimensions so that we're the central toolkit in thinking about how you augment a language model with your own data. So we consist of three different components right now. So you start with the data connectors where you have some data sources ingest the data sources in a format that you could then use with LlamaIndex and as a result your downstream language model. We offer this through a site called LlamaHub where you can actually connect your existing data sources and data formats and we offer over 80 different data connectors of all different formats and styles so that you could hook up any service or file format that you want into your language model application. The next step that we offer after is data indexes.

So how do you actually structure your data for different types of use cases? Given that you've ingested these different types of documents from your data sources, how do you actually split them up to find relationships between nodes and have a way of organizing this information such that regardless of the task that you want to solve, for instance, question answering, summarization, planning, you're able to use this index to retrieve the relevant information. And so going hand in hand with the data indexes is the query interface, where given the fact that you have a set of data structures over your data, how do you actually feed in some sort of input prompt and get back both a response that is knowledge augmented as well as the retrieved documents themselves?

So we can kind of see ourselves being within this overall stack of language model tools and services as being the central interface and data management system, specifically in the service of LLMs. And so on the top, you have this application layer, for instance, whether you're an outer agent or chat bot abstraction, like a chatGPT plugin or a lane chain agent, or just in general, any application framework that you choose to use, management system over your data. And so to be clear, you know, we don't replace your existing data stores or your vector DBs or your structured DBs. We orchestrate the transfer of information within your data stores and structure it and index it in a way such that it's really, really easy to use with your language model. And so we handle a few different components. We handle the ingestion through our data connectors and through our service called LlamaHub, we offer the ability for you to structure your data and basically use an existing storage system as the store for this data. And then we also offer advanced query functionalities. And we'll get into that in just a bit. And in general, you can see us as a black box where you put in an input as an actual language query and the output you get back as a synthesized response as well as retrieve documents.

So let's go through each section a little bit more deeply. So our data connectors are powered by Lama Hub, and they allow you to easily address any kind of data from anywhere and to unify document containers. They're powered by a community-driven hub, and it's rapidly growing. And this number at the moment actually has been updated. It's over 80. And we have growing support for multimodal documents as well.

So in addition to text, how do we actually adjust image data as well? And the usage of these data connectors to use. You can do from LlamaIndex import download loader. And for instance, if you want to use our Notion loader, simply download loader from Notion, do a few lines assessed by the authentication tokens, and then finally just do documents equals reader.loadData. And now you have this central document format that you can then use with LlamaIndex.

The next step is our data indexes and query interface. And we'll go into a bit more detail here in just a little bit. But fundamentally, our data indexes help to abstract away some common boilerplate and pain points for in-context learning. So they structure your data and chunk it up and store relationships in a way such that it's an easy to access format so that you could insert this context into the prompt for in-context learning. It's also a way for you to deal with prompt limitations. So instead of having to stress and figure out how do you actually cram all this context into say the 4000 token window for DaVinci or 8000 or 32,000 token window for GBC4, how do you actually store this data in some external format so that even if the prompt window is limited, we can retrieve and sequentially call the language model to retrieve data from our index. formats and into the right chunk sizes.

The next part is a query interface on top of these indexes to retrieve and synthesize information. So as a very basic example here in this diagram, you can see that the usage is pretty simple. The idea is that given an index of your data, you could send in a query, which is basically an input prompt that you would typically send to an attribute. And then it would look something like this, wiki, write a one-patreon boarding document for new hires. And so this would typically be a prompt that you would send to something like Track Beauty, but with the addition of the fact that here we explicitly wanted to synthesize information from your existing data. And you would get back a response that both contains the raw output, as well as the retrieved source documents. So let's just go through some basic examples of index data structures. And we put two here for now, but a lot of this information is found in our documentation. And we'll cover a bit more as to the use cases of what each index and functionality in our toolkit is good for in just a little bit.

So probably one of the more common paradigms these days is using a similar framework as our vector store index. And the idea here is that first ingest the data from your source documents, for instance, your Notion database, your PDF documents, your PowerPoint files, images. And then what we would do is we would split up the text into chunks and then store each chunk as a node. And each node would be associated with an embedding for that node. And so the embedding could be generated from OpenAI's API, it could be generated from another embedding model as well. But the idea is that we would generate that's associated with each node and store this in a vector database. And so we integrate with a lot of downstream vector databases, for instance, Chroma, Pinecone, VBA, Quadrant, and more.

And we would use this as underlying storage to store the node along with the embedding. Then this is our vector sort index. And during query time, what we would do is that, typically what people do here is they take in a natural language query. You would first generate an embedding for that query, and then you would use that query embedding to retrieve the top K nodes from your vector database. And so it could be a top K value, it could be through some similarity score or equivalent hyperparameter. And the idea is that you would retrieve representing your context from the Spectre store.

So now you retrieved the relevant nodes and then along with the query, you would feed it to a response synthesis module to give you back the final answer. Another example that we have here is a list index. And so this is also a very simple data structure. And it, in fact, is even simpler than the vector store index. The core idea here is that you would have a document or set of documents. You would ingest it in a format such that it's just a flat list of nodes. And you could have embeddings associated with these nodes, but you don't have to. And the idea is that by default, it's just a linked list of nodes.

During query time, what this means is that instead of doing top K retrieval over your nodes by embedding similarity, we would just put all the nodes and retrieve all the nodes in the listen next and put it in the response synthesis module. So what's the use case for this? I mean, you could think about, for instance, certain types of queries where you don't actually want to fetch a specific piece of context. You actually want to go through all the context. have higher latency, but the advantage is that you're actually able to synthesize, go through all the information across all the nodes. And for instance, if you want to summarize the entire document or you want for it to kind of write you a biography, this is an example of what this index would be used for.

Going really quickly through some of the response synthesis modules, there's a few paradigms for actually, once you have a set of candidate nodes, how you actually want to synthesize an answer. So for instance, one example is this idea of create and refine, where given a query, you can actually go through each node sequentially to generate an intermediate response, and then combine all the intermediate responses in sequence until you finally get a final response over here. So the idea is at first, the query over node one, get back an existing initial answer, and then feed that initial answer plus node two's context plus query into a new prompt, and then get back another answer. And you just do the sequential until you get back a final answer. Another example is this notion of what we call tree summarization, where you would ask the query over each node independently, get back an initial answer, and then basically build a tree of answers.

So for instance, take the answer of every two or five or 10 neighboring nodes, get a parent answer, and then do the same. And then hierarchically do this until you get one final answer at the end. Some of the more advanced ways of defining structures over our data include composing a graph of index structures over your data.

And this starts getting really interesting.

And we can talk about some of the use cases below, but the idea here is that you can actually compose indexes on top of other indexes. And so for instance, let's say you have an index corresponding to a document and each document, and you have say like 10 total documents or a hundred total documents, index on top of the documents themselves, and each document themselves could have an index corresponding to that document. So in this example illustration here, you could have a tree index corresponding to document one and another tree index corresponding to document two. And so these are all subindexes and you could link them all together through a higher level of parent. An example of how the query call would work through this graph is as follows. The query would first hit the first node of this list index. And then because this first node corresponds to a sub index, which is a tree index, it would get routed through as queries until you got a final answer at the end. So let's talk about a few of these use cases. And a lot of these slides and examples will contain links to them. And so you can feel free to play around with some of the tutorials, notebooks, and documentation yourselves as well to discover some of the capabilities that LlamaIndex offers.

Probably the most basic use case that you can do with LlamaIndex is semantic search. And so this is basically the idea of given an input prompt, which is basically a question, just do top-k retrieval and use that give you the right answer. So for instance, we do this through our GPT simple vector index for any index backed by a vector DB. And you can ask a question, for instance, over a corpus of data that's about the author's biography. You can ask something like, what did the author do growing up? And then it would do top-k retrieval over this data and then synthesize a final answer for you. Like the author grew up writing short stories, programming on an IBM 1401. Again, this is over one of Paul Graham's essays, And so it's just one of the data sets that we use for fun.

The next use case is summarization. Sometimes semantic search does not always work, depending on the task that you want to do. And for summarization tasks, typically, you want to go through just explicitly an entire document or an entire set of documents. And that's where you can use the list index as an example.

And so for instance, if your task is something like, could you give me a summary of this article and new line separated bullet points, then you can explicitly go through every single piece of context within this index and use every single piece of context within the index to synthesize an answer for you. And so here is an example answer. And this doesn't do top-k retrieval. This just goes through everything. We also have pretty comprehensive text to SQL capabilities, where you can use our classes to actually convert your natural language query into a SQL query that you can execute against a SQL database.

And so this demonstrates our support for not just unstructured data, but structured data as well. And this in itself is actually a pretty basic example. For instance, if you ask what city has the highest population, it can give you this generated SQL query from your data schema. advanced functionality on top of this as well. You can take in your unstructured documents, convert it to structured data, you can add context to your table schema, you can store the table schema itself in another index. All this stuff is found in the SQL guide here and also linked in the documentation.

Some of these more advanced use cases actually do demonstrate some of our graph capabilities or use our graph capabilities that demonstrate these use cases as well. So for instance, like we have one use cases synthesis over heterogeneous data. Let's say you have some diverse data sources each with its own index. For instance, all your notion documents could be indexed in some format. All your Slack documents could be indexed in some format. And you want to basically have and you want that query to explicitly ask both the Notion corpus as well as the Slack corpus and then combine the answers together in some form.

So one primitive that we offer that makes it super easy is the ability to define a graph of your data. So for instance, if you have a vector index for your Notion documents and a vector index for your Slack documents, you can just define a list index on top of these two indexes. And then a list index inherently combines information across different nodes. a question over this graph, like give me a summary of these two articles, it can, it'll route the query down to your Notion index as well as your Slack index, and then combine the information across both your Notion and Slack documents, and then they'll give you back a final answer. Here's just a quick diagram showing this. If you ask about airports in three different cities, and then let's say you have a few vector index for every city, it'll route this query down to every vector index, and then combine the answers at the end to give you a final answer.

We can even take one step beyond that and start doing like compare and contrast queries where it's basically like synthesizing over heterogeneous data. You can, but explicitly in this form where you want to have this question, compare and contrast two different documents against each other. And we offer these advanced modules called query transformations, which you can plug in in your graph index structures to give you the answer that you would want. contrasting the sports environment of Houston and Boston, you can decompose this query into subqueries, like what sports teams are based in Houston, what sports teams are based in Boston. Then you can feed each individual query over to their respective data sources and then synthesize the answer and combine them at the end.

We also offer another step beyond this, which is multi-step queries, where you can actually just break a complex query into multiple simpler ones over the same data source. And then this effectively is similar to train a thought prompting, but with the exception that, you know, like this is specifically over your data source. And so you want to ask all the questions possible over the same data source to get back the answer. So for instance, who is in the first batch of the accelerator program the author started? That's an example of a complex question. contains information about the author, you can break this overall question down into simpler sub-questions that you can answer sequentially until you feel like you got enough information to generate a final answer.

So for instance, the first question you might ask is, oh, what accelerator program did the author start? You get back the answer, it's YC. Now, who is in the first batch of YC's accelerator program? You ask that question again over your data source, get back an answer, and then you're done because you've ended up answering the question. There's a lot of demos of how to integrate this into downstream apps. As implied by the diagram, you could build a trap bot using Lama index itself as a tool and wrapping it with a lane chain outer abstraction, or a Lama index as a retriever module as well. There's examples, we have a Hugging Face space and so you could build a Streamlight app really easily with LlamaIndex.

We have a really cool LlamaIndex starter pack that gives you some of the initial toolkits you need to build Streamlight demos. If you follow this link right here, it'll actually take you to our hugging face space, which demonstrates our SQL, Texas SQL capabilities. And some of these slides, the remaining slides, link to additional demos that we just show to demonstrate some of the capabilities of our different components. So for instance, this example goes through various data loaders on Llama Hub, and you can see how you can easily ingest data from different files as well as websites. And if you go on Llama Hub itself, you'll be able to find the collection of different loaders that we have. Another demo walkthrough basically shows one of our classic examples of just asking questions over Paul Graham's essay.

And finally, we have another pretty cool demo, which goes through some of SEC 10K filings by a publicly traded company, specifically Uber. And then we can actually ask questions and get answers both for each individual document, as well as comparing and contrasting across documents. And those capabilities are showcased in this notebook. Awesome.

That concludes the talk itself. Thank you so much for your time. And again, if you have any questions, I'll be online to answer them. And thank you again. And you can follow me on Twitter. You can find the GitHub for the project as well. And feel free to shoot me an email.

Subscribe to our resources and blogs