Arize:Observe 2023

Scale AI On How to Build and Deploy Custom, Enterprise Grade Large Language Model Apps

David Rokeack, Bijan Jiang, and Osvald Nitski of ScaleAI, discuss how to vuild and deploy custom-enterprise-grade LLM apps. This talk was originally given at Arize:Observe in April 2023.

David Rokeach: All right, hello everybody. My name is David Rokeach. Very happy to see you all. Thank you all for joining. We're gonna talk to you today about how to build and deploy custom enterprise grade large language app model applications.

Obviously a lot of news and excitement around all of the possibilities of generative AI and large language models. And we've been doing a lot of that work here at Scale. And so we're excited to talk to you a little bit about what that looks like and how to do it. So we'll start with just some quick introductions and then I'll give a scale overview, just a summary of what we do and some of our offerings. And then we'll also cover just a quick demo about Spellbook, which is our platform that enables developers and teams to quickly deploy, test, experiment, and launch large language model applications, and then share a demo in more detail about our product called Rapid and our work in reinforcement learning from human feedback to get into some detail on that side of the coin as well. If you do have questions, feel free to enter them into the chat box here. send us questions via Slack, happy to answer them.

But we'll go through some of the content here and then be able to cover questions later. So just very quickly as an introduction, again, my name is David Rokeach. I lead Enterprise AI for Scale, which is our business unit focused on end-to-end deployments of AI solutions for enterprise customers. And before joining Scale, I was a partner at Boston Consulting Group where I focused on AI strategy and building and deploying AI use cases.

Bihan Jiang: Hey everyone, I'm Bihan. I'm a product manager on the spellbook team here at Scale public is like David mentioned platform to build, deploy, experiment, large language model apps and have been at Scale for two years now.

Osvald Nitski: Hey, I'm Osvald. I'm a strategic product manager here at Scale. I work on our generative AI projects, products, offerings. Specifically, today we're going to talk about our LHF offering. And before Scale, I was an ML researcher. Then I started a company and ran out of money. So yeah.

David Rokeach: Great, so just very quickly, I'll give a quick background to Scale for those of you who aren't familiar with us. We're a technology company based in San Francisco and founded in 2016. Since that time, we've raised over $600 million in funding and our last valuation was over $7 billion. And we have over 600 employees globally with offices in San Francisco, New York, Washington DC, St. Louis and elsewhere.

Our mission is to accelerate the deployment of AI applications. based on the right-hand side, some examples that kind of tell the story of where scale came from and some of the work that we do. Really a lot of our start and foundation was really in supporting autonomous vehicle companies and OEMs in the data labeling and annotation that they needed to power self-driving vehicle technology. We've grown pretty significantly since the early days. We support customers now across industries, including the federal government, Department of Defense, private sector industries across different types. And so a lot of this work in generative AI has certainly taken the world by storm in the last few months, but it's something that we've been working on for quite some time. We're a leader in large language models in generative AI, and we've been a long-term partner of companies like OpenAI and Adept and Coheer and Anthropic.

And so what we've been doing for a couple years is basically kind of helping us serve as the foundation for the development of these models, particularly in supporting the human feedback required to build and deploy these models in a way that creates the outcomes and possibilities that we're seeing today for the large enterprises.

So what we've built in recent months is a view of what it takes to deploy large language models and generative AI for the enterprise. And many companies out there have been experimenting and leveraging OpenAI and some of the APIs directly. But we believe that there's a really comprehensive platform that's required in order to build and deploy these generative AI solutions at scale and with speed. you can kind of see just a high-level framework of what we think is really necessary to create this capability and make it useful on a long-term basis for companies. And it starts with the application layer. So of course, a lot of applications out there, chat applications and others, we've built our own chat user interface for dialogue, as well as additional industry-focused applications that support some of these language model use cases. But we believe really where the happens is in these inner layers. So the gray layer, you know, the second gray layer that you'll see is what we call Spellbook.

And Bihan will share a little bit more detail around, you know, the demo and what that looks like. But really, this is the API layer that enables teams to really quickly experiment, test, evaluate, and deploy large language models, you know, multiple use cases across multiple teams. And then this is connected to our data engine, or where you'll see layer in.

And this is where Oz will talk a little bit around what's required to actually get models to perform at the level that's required for enterprise customers that have really high performance requirements and the fine tuning and what that can do in improving that performance. And that all comes together by pulling in and making accessible all of these base foundation models, whether it's from OpenAI, Anthropic, Cohere, Adept. We models or open source models. And we do this all either within our own hosted environment or within a customer's virtual private cloud to address any of their data security and privacy concerns. So I'll pause there. I'll hand it over to Bhan to talk a little bit in more detail about Spellbook and how we can enable teams to build and deploy these quickly.

Bihan Jiang: Cool, thanks, David. Yeah, so I'll first give a little bit of a high-level overview of Spellbook and why we built this platform.

And then I'll actually jump into kind of a hands-on demo to walk through what the product looked like. So like David mentioned, Scale started as a data annotation company. We've worked for many years with a lot of, you know, different customers doing NLP and other data labeling. And especially in the past year and a half, large language models have really taken off. And what that means for a lot of these companies individuals who are originally building these NLP solutions, they're replacing a lot of the bespoke model building with LLMs.

So there's a lot of things that go into making an LLM application ready to be launched into production. The first is you want to choose different base models and do experiments to compare how different models are for your own particular use cases. With Spellbook, we integrate with a bunch based models across OpenAI, Coheer, AI21, and we also host our own open source models. It's also pretty challenging for large organizations to evaluate how exactly those LLM apps are doing.

So you want to make sure in production, your LLMs are not hallucinating or generating content that could be harmful. You want to do systematic evaluation, especially for generative tasks. And you can do that with Spellbook with our data engine. And then finally, when you're ready to actually deploy your LLM apps in production, you want kind of like ease of deployment monitoring and ease of switching out the LLMs at will.

So I'll now jump into kind of a hands-on demo of this public product. Thanks for watching! So in Spellbook, we have the concept of apps. And you can think of each app as a different use case for your particular, you can think of each app as a particular use case that you want to deploy in production. So today we'll be walking through an example of creating a marketing generation app. So we can just first title it marketing copy generation. We can create this app. And then you're dropped into a prompt page where you can talk to the LLM and give it instructions on what exactly you want it to do. In Spellbook, we build out templates that you can use to get started. And you can also upload data into Spellbook to test multiple examples of your particular use case.

So I'll select one that I prepared with marketing copy. You can see the example columns that you have in your dataset. ahead and generate an app that will write emails for us. So we can do this by instructing the LLM, generate me an email about the event with the demographic of size. And we can then select different base models. So like I mentioned, we integrate with a bunch of different base models across OpenAI, AI21, Cohere, and we also host open source models like Flon. For this particular example, I'll just use GPT-3 DaVinci, and then we can go ahead and set our max tokens. What max tokens are is this controls the output length of your model. We can just set this to 400 tokens, and then we'll click Generate Outputs. This will immediately run the LLM based on the prompt that you provided over the data that you provided. For each of these rows, what it's doing is it's writing an email about the event that's listed with the demographic in size. You can see it's generated a bunch of example emails that are pretty much ready to go. and use these directly.

But how do you actually make this production ready? You want to have a systematic way of comparing how your LLM apps are doing, either based on some certain set of criteria, or you want to maybe compare how different base models are doing against each other. So we can save these as new variants in Spellbook, and I'll save this as marketing copy Number 03. We'll save this as a new variant. And then in Spellbook, you're able to run evaluations. So especially for generative apps, it's unclear how exactly you want to grade the models or generate a report card for the models. So the gold standard right now is still having humans in the loop to evaluate these models based on certain criteria. So we can go ahead and click Viewer, evaluations, and we can create an evaluation.

In Spellbook, you can choose from either human evaluations or programmatic evaluations, depending on your app type. In this case, we'll choose a binary human ranking evaluation, where for each of these generations, a human will go look at the model output. And based on criteria that you define, they can either say this is a good model output or a bad model output. We'll select the particular variant we've created, and then we can define examples of good criteria, bad criteria, and the main tasks of the app.

So in this case, maybe we want to evaluate whether the model output generates a good email. And we'll define this as the email is polite, grammatically correct. and has the relevant event information. And then for bad output, we can define this however we want. We could say something like the email is rude or grammatically incorrect. This is very flexible for you to define. Different people will care about different evaluation criteria before launching their LLM apps into production. Once you've defined the criteria, you can go ahead and start the evaluation. And what this will do is it will send these generations from your model to either your own workforce or Scale’s global workforce of nearly half a million like human reviewers. And this will result in what is like an accuracy score for your model based on these generations.

So we can start this human evaluation job and it'll end up resulting in something like this where you'll see a hit rate, people who've thumbs up, thumbs down your particular application. And then you also will be able to download the full results as a CSV. So this is just one example of running an evaluation, but you probably want to do this multiple times. And so what you'd be able to do in Spellbook is you'd be able to go and save different versions of these, maybe with different base models. We can try one with DaVinci 002, save it basically as the same thing as marketing generation with DaVinci 2. And then you'd be able to run more evaluations, and then once you're ready to deploy your model into production, you can deploy with one click through spellbook by selecting the particular variant that you want to deploy, and then clicking Deploy New API Endpoint.

So. Okay, I'm waiting for this. I'm going to redo that part. So yeah, once you're ready to deploy a particular variant to production, you're able to click Deploy, which will immediately deploy an API endpoint for you to integrate into your production application. So here, you get code snippets that you can integrate. And we also have a Google Sheets integration that you can use if you don't want to write any code. We also have built-in monitoring. You can see latency. And you can see a list of all your recent requests. So this is a pretty high-level overview of Spellbook. Again, Spellbook is kind of the API layer and the web app UI layer for builders to go in and hack around and build these LLM apps themselves with guaranteed quality, evaluation, and deployment.

Okay, alright. Cool, so now I'll turn it over to Oz to talk more about RLHF and what we offer here at Scale.

Osvald Nitski: Awesome. Thanks, Bihan.

Every time I see the Spellbook demo, it's better than the last time. There's some new features. So super exciting progress. And I'm sure by the time everyone in the audience checks it out, there's going to be even way more cool stuff there.

So yeah, besides this kind of API layer, if we peel back the onion in another layer that David showed at the start, the center is foundation models and RLHF, which is just one way of training foundation models. we really think about how we can advance AI at every level of the stack. And what my team works on specifically is training data to make sure that these base foundation models are better, more robust, safer, more capable, more aligned with human intent. The way we think about it is a continuous process. So the first language model wasn't perfect, but because it's a generative model, you don't even know what you want to evaluate. So even if we were trying to find out how good this base model is, how good GPT-3 is, plan or whatever, it's a cycle. It's iterative. So these usually start off by being trained on crawled data from the internet, and they just complete sentences. So the first wave of generative models, all they would do is kind of work in completion mode.

The next kind of step where humans came in and made the data a lot better and made the models a lot more interpretable, made the models a lot more usable, a lot more user-friendly, was this thing called instruction fine-tuning. that we're looking at here, which is called paired prompt response gathering. And this is just having humans write instructions and then write responses to them. This is things like, summarize this text or write me this email, and then the appropriate response to that. But of course, you want to have a lot of diversity in this data. You want to make sure that you're covering all the use cases that users might give you, but you don't know what they're going to be ahead of time. So we consistently evaluate our models as well. And we'll show you both the collection of this data later slides, but we continuously evaluate this data. We also have a red teaming group. It's led by a famous prompt engineer, Riley Goodside, and now he's starting to train a bunch of other people. So we've got this whole huge red teaming operation as well. And we iteratively evaluate these models and then collect training data that fills in the gaps, the weak spots, like areas where models aren't good at. This could be something like aspect-based summarization. It could be something like abstractive question answering. But the point is we get this loop and iteratively collect more and more data. so that these foundation models have more capability.

But then the next stage after that, and the one that's really exciting right now, is RLHF. So after you've collected a lot of these prompt response pairs, your model actually responds to commands and instructions the way that humans would present them. The next question is, how do we make sure that the model gives the best response, like the safest response, the one that's most aligned with what people expect to receive? And RLHF is a very, I mean, it's a method that's been worked on for a while, but it's really popular right now. that entails is training a reward model to mimic human preferences, and then training the base model, the language model that you'll be calling, to maximize that reward function. So there's a loop in there as well.

This isn't like a one and done. We have to continuously find out new ways to evaluate, new metrics to align against, make sure that the model inputs that we're aligning cover a wide range of topics, make sure that our raters come background, and even still, this takes a lot of iteration, and we're still at the start of a very exciting journey. So what this looks like is a model is sent to us. Someone will give us a model, like all of the ones that you see in that dropdown that we hunched out earlier, and we'll have a diverse set of workers prompt the model, so with a question, and then receive multiple responses and rank them according to some guidelines.

And these guidelines are usually things like safety. Was it harmless? Was it helpful? Was the response exactly what was asked for? Is it informative enough? Sometimes they're too terse, sometimes they're too verbose. But these guidelines, they're really an open problem. So the first set of RLHF that was done was done using a pretty good set of criteria. Just kind of like helpful, harmful, sorry, helpful. It was done using a pretty basic set of criteria. Was the response helpful? Was it harmless? And were there no hallucinations? So now people are starting to realize that you might want to have other things that you want your model to be aligned to. So this could be something like brand voice. It could be something like a specific type of information, a specific way of helping someone, a specific personality. And we're starting to see an explosion in requests for aligning a model with a specific voice or for a specific use case.

So I think the next slide here shows an example of what this looks like in our platform. So this is the first step here, actually. So this is that instruction response pair data collection. We have a wide variety of categories that you can choose from, ranging from summarization to generation to whatever. We have canned instructions that our taskers are really, really familiar with that are then used as a starting point. So something like spinning up an email generation project is as simple as just a few button clicks. modify some of the instructions, make it so that the taskers are making emails that fit your use case, and then you can launch tens of thousands of data points. You can launch a project that will collect tens of thousands of data points, and you'll get them back faster than basically any other method.

The next demo that we have is for both RLHF and for eval. Typically for eval, what we'll do is a model comparison. So you'll either compare your model outputs to a competitor's model or to some kind of human ground truth. In the case of RLHF, you're usually comparing multiple outputs from the same model. So pick the one that's most aligned with whatever metrics you have. And this is also a super quick setup. Our taxonomy is super flexible. We can do Likert scales. We have checkboxes for things like toxic outputs, hallucinations, any sort of bias. And this is really kind of the frontier here, is this evaluation. Because we're seeing that when these models are deployed, they seem like they do really well, but then someone will ask it some wacky question that was never in the training set. And then we find a breaking point. So our evaluation loop is meant to systematically find out these breaking points and then collect data to correct them.

As the alignment gets more and more, people start to poke holes in it, and we start to learn more about how alignment works, how our LHF works, what the weaknesses are there. We're continuously advancing our guidelines that taskers are rating against or ranking against. So this includes more than just like, was the response better than the other response? This can be, was it safe? And then what safe means is constantly evolving. We're starting to see more and more work on IP compliance. to see more and more work on having some kind of backup. If someone asks a question that's a bit sensitive, we want to still be able to talk about it objectively. So not to produce a toxic output, but also not just to reject any question related to religion or politics or anything. We need to define that boundary of what's appropriate and what's not. And it's different for every customer. So this whole loop or set of loops, it's very iterative. It's constantly evolving. us to make better and better foundation models that are then iterated on by customers at the API level to make better and better apps. And we're really just at the start of a huge generative AI explosion.

There's so much more work to do in this area. So if you're making a model, if you're fine tuning a model, please reach out. We're very collaborative. We love to co-develop these guidelines, help build up best practices, and we're all just really amped to see where the field goes. I think that wraps up our kind of end-to-end model, API layer, app layer talk. So if you have any questions, feel free to drop them in the chat or reach out to any of us. Our emails are here. Follow us on Twitter, add us on LinkedIn. Whatever.

Bihan Jiang: Cool, thank you.

Subscribe to our resources and blogs