Arize:Observe 2023

Using LLMs with Hugging Face: the Necessity of In-Context Learning

Large Language Models have dramatically changed our expectations for AI. While a few innovators are building proof of concept projects using APIs, most enterprise teams haven't figured out how to incorporate LLMs into their analytical toolbox. Rajiv Shah, ML Engineer at HuggingFace, shows the necessity of understanding the growth of "in-context learning" and agents. With these insights, he explains how LLMs will shape enterprise teams. Along the way, he covers many practical factors, such as the different providers of LLMs, resource costs, and ethical issues.

Rajiv Shah:  All right. It's an exciting time right now and I know you've all watched other presentations around all the capabilities, all the great things that's going on with large language models, and you know, I think it's amazing what we can do, whether it's asking questions to these models and getting answers, code completion, how we're doing it as a code assistant.

Even novel applications like protein folding were kind of a very simple thing where I can write down and sketch out what I want my website to look like on a napkin and then it will generate the code for me. So lots of cool things happening, but what I wanna do is take a little bit of a different tack today.

So my background in kind of what I do at Hugging Face is I work with lots of data scientists and data science leaders and sea level executives. That are in enterprises and for them, they're wrestling with a lot of questions about, you know, how are we gonna work with these large language models? And so what I wanna do today is, is less about how to tune these particular models like that, but more thinking about how these are gonna fit into kind of our AI infrastructure, our roadmaps, and our plans over the next year, two years?

How is it gonna reshape our enterprises? And when it comes to this, I think there's two fundamental questions that I get asked on a routine basis that I really wanna spend time today talking about. One is, you know, when should we use large language models? Like what are the use cases, what are the times that it makes sense to use that?

And the second is, which large language model, right? We're all familiar with chatGPT, GPT 3, 4. But really there's a ton of different language models out there. So what I want to do today is just kind of navigate this discussion and for people thinking about how they're using large language models inside their enterprise.

I'm hoping this will shed a little bit of light. I know this is a little bit of a different track than most kinds of videos or posts that you see on language models, but it's something that I feel is important and it's come up quite a bit for me. Now when I go through this, I think there's two important states of mind that we should be in.

One is, I think that's useful, is a very analytical perspective where we step back a little bit from the hype, you know, the never ending retweets, and we take a little bit more of a dispassionate look at these technologies, and I'll be using that frame here. But I think it's also important to recognize just the radical nature of these technologies and how transformative it is.

You know, how we do AI today is gonna be quite a bit different three years from now as these technologies start settling in and getting used. Now, a question that often comes up that I hear all the time that executives get asked, you know, is, you know, what is our chatGPT strategy? What are we doing? We see this technology, you look at the headlines, you see, hey, you know, these companies have signed up and become partners.

You know, what are we doing inside our company? And most executives at this point, right, they feel a little bit kind of deer in the headlights about all of this. All right. Looking at the comments coming in too. So what I wanna do is first is just let people know, first of all, when we're thinking about large language models, Most of them are at this point, are really text base, and that just is a subset of probably all the different use cases that you have planned out over the next six months or the next year like that.

So in the near term, Yes, you know, you're gonna have to keep working on your churn, you know, projects, your forecasting models, your dashboards like that, but it's not gonna radically change everything right away. So kind of take a deep breath as you're going through this. All the things that we see on a day by day, week basis in terms of planning out and thinking about your enterprise.

We have time. We can relax now as we go through this. I'm gonna keep an eye on the comments and I think if their questions that are gonna come up are also gonna get starred and I'll be able to look at those. I'll try to address some of those on the way. If I don't get to 'em all, I know there's the slack afterwards, I'll do that.

And then for some of you that know me, know that I like to make videos as well. And so maybe some of the most interesting questions at all, I'll kind of make some short term videos and post those on Instagram or TikTok or whatever. Um, like that. All right. So I think that one thing that's important is with large language models, kind of the, it's important to understand how they work in terms of some of the knobs and gears that they work.

And it's a little bit different than our traditional, for example, XGBoost tabular models. So I wanna spend a few minutes just going over some central concepts for these models. And if we start, you know, GPT-2 was one of the first kind of models that we saw. This generative ability where we could see how it could, for example, help us write a story.

And for a lot of us, this was interesting, right? Like I, I wrote up like funny insurance claims when I worked at an insurance company with these technologies, but there weren't a ton of use cases for this kind of thing cuz I mean, who just, we, there's only so much for creative writing. But then one of the advances was something called instruction tuning, where we're gonna fine tune the model and we fine tune it by asking it particular questions and then showing it what type of, as the appropriate response to that.

And in this way, the model learned that when I ask it a particular question like, Hey, will you convert? This from English to French, it now knows what should that next word should be after cheese in this sentence. And this is an example here of Zero Shot Learning. Now, as we've been able to kind of play around with these, one of the interesting things that's really happened is something called Few Shot Learning.

And with Few Shot Learning, what we do is we give it a few examples. And then the model is able to give us the correct output. So here you can see I have examples of movie reviews, but then I give the language model and ask it, you know, what is the output? Now the important thing is with this few shot learning when I'm doing here is this is a purely prediction.

I'm not changing any of the weights. With this few shot learning, the weights are all frozen to do this. And this was really kind of very novel. And I think this was one of the exciting things about this that opens up lots of possibilities of use cases. Because all we have to do is show it some examples, ask it the kind of output we want, and we'll get that.

And you'll hear this called, In context learning is this ability to do things. This is something that we didn't see when we had smaller models, but as these language models got bigger, this is something that was called an emergent capability that we only saw with the larger models. And this is why if you asked me two or three years ago, like I was like, yeah, of course they're creative, they're helping.

But I never would've guessed that we would've got to this point. And partly is because this was a new phenomenon that emerged. Now besides that, another fine tuning that often comes up that you'll hear is Reinforcement Learning with human feedback, Arla hf, which is a very hot topic now, and what this does is it helps us with, as we get lots of these outputs from the model.

They're always, don't always kind of make the best sense to humans, right? They're all good writeable, but they differ a little bit on the stylistic, on the factual accuracy of that. And this is where we let our, the reinforcement learning with human feedback helps us tune these models just a little bit better like that.

And you see, this is a widely used technique that's coming in. The other thing that you should be aware of is something called parameter efficient, fine tuning, or p. And the idea with P is instead of having to train the entire model when we're doing the fine tuning process and right, these models can be very large span, multiple GPUs.

What we can do is only train a subset, right? Freeze. Most of the models only have to efficiently train a little bit of the model. Now this is the, this is a very, this helps us a ton with the compute for the models. And there's a lot of research and a lot of work going on in this because it becomes a lot easier to use these language models if we can efficiently use our compute to help train or fine tune them.

So you'll hear these words come up, but what I just want you to have in mind is just this picture where the degree of effort, as a degree of effort goes up from zero shot to few shot to fine tuning. These are all steps that we can take. To build and get better predictions. Okay, so now that we have this, I'm gonna jump into a little bit more

and the comments so far seem kind of empty, but we'll see. I'm hoping that I've got the right comments here, um, in front of me. So let's ask this question, like, when should you use a large language model, the GPT-3 or FLA T five, versus maybe using a Burt based model? Now, if we look at large language models, I can do something that we used to do with a Burt based model, like text classification like this.

I can give it a few examples with a few shot learning, right? So I've given it here in this case, like two different examples, asked it, Hey, will you label the third example? Very easily to do this. And when I use large language models to do this, for example, something we've all seen summarization, the, the performance of these large language models for tasks like summarization are right up there with human level performance.

So in this graph you'll see that the freelance writer and the zero shot instruction Da Vinci, kind of a GPT-3. Those are right there together like that, and I think we've probably all experienced this, if you use Claude or GPT, is how close they are to the human level.

Now when we move to a few shot learning, so this is on one particular model with one particular data set. You can see as the models get larger, but as we also go up from zero shot to one shot to few shot. We can improve performance, even in this case, being able to beat out a finely tuned state-of-the-art model.

So there is a lot of capabilities in these large language models. Now, these large language models, we can fine tune them. So here's an example of fine tuning where I've given it lots of examples of a sentiment classification to do that. Now when we do this fine tuning and one of the questions is asking about PT and P is still either changing or adding some weights, it's just doing it in an efficient way.

This is an example here where Stitch Fix did some fine tuning where they had products that they wanted, they had some, they tuned it with several hundred descriptions that their writers had already written. But then once they train the model, they could then give it new products, get new written descriptions, and they actually found out that the descriptions written by the computer by GPT were actually much better or rated higher than the human written examples like that.

So this is why I heard of there, there was a co here had a fireside chat with Armand, uh, a week or two ago on their podcast. And he was talking about the current state of these lms and he talked about, you know, how they're affecting the different domains of AI from thinking about semantic parsing dialogue, right?

All these different subsets inside NLP and, and you know, his position is what we've seen with large language models is. These are all dead, like large anguish models have assumed and broken the benchmarks in these areas. So this might, you might think like, oh boy, like, do we even need traditional models at this point?

And this is where I say like, let's not get caught up. Let's bring our cold dispatch analysis to this. Let's think about what problem we're trying to solve. And in the case of pre-trained models, We have lots of existing pre-trained models, so this is the hugging face hub. You can go there, look at it for particular tasks, find lots of pre-trained models.

And when we apply these pre-trained models, especially if it's a domain knowledge, that might not be the best in GPT, you know, one of the, one of the, one of the open source or one of the large language models, if you have that edge here. So in this case a finance use case, you can get a still a pretty dramatic performance.

Increase by having a s mo by using a model that was just trained in that financial domain. So this is an example with just traditional sentiment. The, the, the improvement isn't much better. But again, the fine tuning is doing better than zero shot and by zero shot, it's essentially, you're not giving it any examples.

You're just asking it right away in the prompt. Give me the sentiment for this sentence versus few shot where you might say, Hey, here's some examples of how to do this. Um, another paper looked at this across different N L P tasks and found out that, you know, for pretty much all the N L P tasks, if you take the time to fine tune a model, you're gonna get better performance.

And so hopefully you're now starting to see the trade-offs. One way I like to, you know, just visualize and think about this is, it's not a perfect analogy, but just kind of something. An LLM is a very general tool. It's kind of like the phone in your pocket. It's great for lots of different things while language models that we have, the existing ones are much more specialized.

They're single purpose. Just like a, you sometimes have a professional photographer, sometimes you take a picture with your camera, right? You have to pick and decide when is appropriate to use each tool. Now, when we're thinking about this, it's like some quick rules of thumbs I give to teams is, you know, for the high value use cases, or is this a hundred thousand million use case?

Well, you're probably gonna have the data science team working on it. They'll probably want to build a dedicated model. You know, if this is something that's a very domain specific, and by domain specific, I mean, You're having information in there that would not be learned in general kind of Wikipedia books, the type of information that's inside, um, most of these large language models, it's like a very specific like finance information.

The other reason you might want a dedicated model is if you're doing a high scale, like you, you're doing millions or billions of it, you know, using that could be very preferential if you need low latency. Um, explainability. Maybe you need to be able to explain to somebody how the model works.

Model risk management. Lots of, lots of people in regulated institutions have to talk to model risk management teams before they can deploy models. It might be easier at this point to do that with a dedicated model. These are just a handful of the criteria. There's really lots of other criteria that you have to think about.

You know, sometimes it's quick and easy to get something up and running with a language model. And you know, if you're doing something at a small scale that's not very risky, you can do that. It could be, what is your enterprise M ml op strategy? You know, have you already, is it easy to take an existing Burt model and put that, you know, have, has anybody built the infrastructure yet to be able to scale out, you know, a lang chain flow that you make with an LLM?

So these are things that are all developing, they're all changing. I often sit down with enterprises, make little flow charts of these with all of this. So I think the story is I think at this point, you know, if you're a data science leader, you're worried about this stuff, large language models, they're gonna have a very low impact on your current roadmap.

Okay? The existing use cases you have. Now, that doesn't mean we're all done, right? Uh, because some of you might know, right? Uh, there's a little bit of something magical with these things, and so let's talk about that and how it might change, right? Because we know. Lots of people came into chatGPT, right?

There's something about this that has it, and if you take a look at the examples, right, there's a ton of enthusiasm for using large language models and you see all the different workflows that people have put together. Like this is an example showing how to use it with kind of GPT-3, but I think it's still very cool, like how, in this case, Right.

You're using GPT-3, you're taking some messy address data, you know, figuring out the zip code, um, uh, zip code to do it, or the state abbreviation. Like I think this ability, and if you watched the recent, what was it? Like the, the, the TED Talk by Brockman. He had another example of using the GPT in a, in a spreadsheet.

Like there's a lot of very cool productivity things that people can do that are outside of like what we think about in terms of existing use cases. Um, the traditional use cases in a data science concern and Right. Some of the early research, there's an early paper done. It's like, you know, right. This is kind of raises productivity, I think for professional workers.

We're already seeing that you, right. If, if you're anywhere close to Twitter or something, you see people endlessly talking about this piece. But besides that, I think the other thing that we're seeing is, people really like to use this as a search engine, right? This has become a new way. Instead of having 10 tabs open, I like to just be able to type and get my search in there.

Now, of course, we are all aware of like, that's a very dangerous thing. Hopefully most of us are aware of the that these things aren't factually grounded. There's hallucinations in there. Now one of the things I like to show is, I don't know if you've all seen it, I know we had the tool former demo earlier, but Wolfram has an excellent demo out out here.

And what I like to show it is, is, is where we're gonna see this next generation of this, like we're used to chatGPT, we ask it a question, but one of the things we can do is start connecting a to external tools, right? So if I ask it to math questions, it figures out. Hey, let me go look up the math tool and pull that in, right?

If I ask it, the weather, right? That historical data, data set that was trained, you know, that they got from 2020 and back is not gonna know what the weather in San Francisco is today. But if I can connect up that question. To an API that has that information, well then that user can have that back. And so for me, this is like boom, like, like there's so many possibilities when we start thinking about this and, right.

You see open AI's doing this. We're gonna see this with other large language models, is the ability to add new pieces here, this retrieval plugin to save information, browsing code interpreters like that. Some of the things I want people to be aware of is some of the trends we're gonna see over, let's say the next year, 18 months, with large language models.

So we'll see more of these code-based large language models where we're interacting with both text and code. I think there's some great things you can do in terms of like, I need to visualize some data. It'll give me a plot for it. Multimodal lms, we saw the first preview with GPT-4. You can get the open source mini g p t, which also takes a vision transformer, combines it with una, gives you a taste of exactly what a multimodal lang language model is.

Being able to refresh and update language models. This is coming. Some of the other newer language models are trying to refresh, finding ways to be able to kind of give you the latest information. In doing that, we're gonna see a lot of plugins and services. I think lots of companies are already thinking about how they can make their tools available to people inside these.

There's a large area that's gonna go into managing the security and risk of large language models, and especially this is gonna be, as we talk about the difference between going with a proprietary or open source, this is something that's gonna have to come into play, is thinking about all of this. And then, as I mentioned earlier, Right.

If you're gonna put these into workflows, into practice, if you're gonna let people use prompting to build things, while we need to think about how we're gonna operationalize them as well. Now beyond that, I think there's a whole new generation of use cases. Like we're doing very simple things like, Hey, let me take the customer call and I'll, let me summarize it and find the problems.

I think this is just the early piece. I think there's gonna be one stage, two stages of these use cases where we're gonna do incredible things, where we'll have essentially agents running, doing things in an autonomous way. I have no idea exactly what it is. Nobody does at this point, but I think organizations should be aware of this.

You want to be listening to your customers, listening to your stakeholder, your workers inside, trying to keep your ear to the ground to figure out, you know, as we start better understanding these generative tools, how we're gonna use 'em. Because the reality is, is most data scientists don't really know very much about generative tools.

They've only really been active, right? Maybe the last year, last two years, like that. We haven't really thought through all the different use cases. We can do that versus traditional classification and regression. And so this is where I think that transformational impact's gonna come into play. All right.

I know I'm running outta time. I've got a bunch of stuff I want to give ya, so I'm gonna keep going, but please, if you have comments in their questions, I'll try to get to 'em as we go through this. The other thing I want people to be aware of, and it's very easy in this media world, is to remember that there's lots of choices in large language models.

We don't have one. There's many different ones that are available out there. Um, and that dev is a great tool that I like, cuz you can type, type in something, see how it looks and compares on different, different ones. Now, if we look at the world of large language models, there's a few places doing it.

There's a whole range from proprietary models that are closed off to fully open source models out there. Now these models all differ in terms of what is the data that was trained on, right? Some are entirely web, some are entirely code. There's a mixture of data types. It's important to know, especially if you're gonna start running your business on it, because that's gonna affect the type of use cases you can run on it.

Similar thing with when was the, the dataset large updated? Is this, is this a language model that was trained on really historical data, or does it also have some up-to-date refreshing ones as well? A lot of lo trading costs as well differ, but at the end of the day, for most organizations, it's gonna come down to three boxes here.

Can you tr you know, are you gonna train your own large language model? Should you go with something open source? Should you use a commercial large language model at this point, largely through APIs. Now we have a recent example. The Bloomberg team has written a paper shared exactly. Oh, I should have put the citation for the paper.

Um, if you type it in, you'll find it. Right. They built their own right. They took, they had a large amount of public data. They had a large amount of proprietary data. They went over to SageMaker, took out a ton of different, different compute there. They built a model. It ended up being better than the other open source models.

Is this replicable by teams? Yes, but I mean, this probably took six months. This was a pretty sophisticated team to do this. They spent hundreds of thousands dollars in compute to do this, and again, they had proprietary data for this. A lot of open source lms. I hugging face what I'm part of. Right. Helped worked with big science.

Big science. Built kind of one of the largest open source models. There's big code, we'll have some updates coming out to that, which is a code completion code generation one. Um, we're working on an open source version of Flamingo. Stay tuned for that. Besides that, there's a ton more of these open source language models and, and to tell you the truth, I can't even stay up with them.

Like I need to update this to include the latest ones from stability, AI and H two O like that. So, I mean, it's a great place to be in that we're getting a flood of these different models. Now when we look at commercial large language models like, like G P T four, one of the things we see is we, they republish technical reports on it.

We don't really understand or know what's in it. Literally, it could be three raccoons and a trench coat, and they'll admit it. Like they're not gonna tell you all the details about this, which, right, as a data scientist, somebody that wants control, that knows all of this stuff is running on them, gets me a little bit worried.

Now you can look and there are providers, there are places we can go and look for benchmarks to help compare these. Now, one of the most popular is the Stanford Helm report. That was out there, but we should be aware when we're looking at the benchmarks, a lot of the existing benchmarks reply on multiple choice answer prompts.

And this is where you can do really good on multiple choice. But then when you write something long form, it doesn't look as well. I think this is where like open eyes and Anthropics models have done well, is they've been very opt well optimized for those long form questions, um, that we want as well. Now I'm gonna leave you with us the last two points here.

So, one thing I think is we know traditionally over the trend is, When we work with open source, open source often lags early on behind a closed source model where you have a devoted team that's struggling working hard at it, but the collective open source, if you get lots of people involved, that crowdsourcing like that often will go ahead and pass.

Of that. So what we see right now is we see a lot of releases on the open source. We see the gap is slowly closing with the release of things like Open Assistant and Una. But at this point, right, it's clear that models like Anthropic and OpenAI have a clear lead like that. I think one of the things we should keep in mind when we're judging this is what's good enough, and I think this is also gonna be hard for the proprietary models cuz they're gonna have to keep bouncing up.

What's good enough, because there might be some use cases where we're just like, you know what? Open source models, it's good enough. We don't need to go to the latest kind of model like that. And it, and this is where, again, I use the analogy of the iPhone, right? There's times in the iPhones where like every year it felt like you had to upgrade, right?

There's times where, eh, sometimes it feels like you don't need to upgrade and we're gonna have the same kind of experience. And so this is when as a leader, you need to think about your use cases. What are your evaluation benchmarks? So you can start thinking about that. And here's a long list of considerations, um, when we're going through this to think about in terms of right, the, the training data set.

Do you need this something that's run on Oakley right? Time to market like that? Like I don't have enough time Weights and Biases wrote, um, uh, recently wrote a technical report on training lms, which has a great section on kind of these considerations for picking large language models. I'm with Vicky though I think most enterprises are gonna want to own the means of production.

They're gonna want to be able to, um, have the model themselves and I think we'll see a lot of open source models in there. And so what I wanna do is, I'm gonna end here and kind of point out that, you know, enterprises are gonna need several different types of large language models In the future, you're gonna see ones that are general purpose, domain specific ones, ones that, for example, might be fast inference.

So when I ask this question, right, when we talked about these different pieces of, you know, the analytic and it's transformative and what should we use for LMS and which lm, this is what I wanna leave you with is if you're right now, you should go ahead and start simple. Get an internal chatbot running.

You build a search app using an lm. Start with a vendor api. Get an open source model. As well, right? Figure out how inside your enterprise you're gonna start, you know, allocating GPUs, getting all of that infrastructure set up because we are gonna get to a world where we're gonna have crazy use cases and you're gonna have multiple large language models.

So getting your teams up to speed on that is gonna be really that. Also, don't get scared. I know there's a lot of papers out there that are building out this stuff. But I promise you in this case, with large language models, we have a huge army of people that are working on building and making this easier to do.

It's not gonna be super hard to be able to take advantage of these things, um, like that. So don't get caught up in the weekly Twitter stuff on all of this. So feel free to reach out to me. This is a topic I love, I love talking about. I took up the entire time. Um, with this. I'll will try to answer all the questions in Slack, but otherwise feel free to reach out to me individually.

Subscribe to our resources and blogs