Arize:Observe 2023

Training Billions of Parameter LLMs with MosaicML

Hanlin Tang, CTO and Co-Founder of Mosaic ML, talks through what it takes to train billions of parameter large language models. This talk was originally recorded at Arize:Observe, in April 2023.

Hanlin Tang: It's great to be here at Observe. My name's Hanlin Tang. I'm the CTO and co-founder of MosaicML.

And today I'll be talking through what it takes to train billions of parameter large language models that have become all the rage in the last few months.

As we've seen in the news, generative AI in the form of large language models and chatbots have become extremely popular in many different applications throughout the ecosystem. And I think one kind of dominant narrative that has come out very early on is that these models are so difficult to train yourself. And therefore, only a few foundational models will really exist out there.

And this is the case of where we see one general AGI model that's being used for every use case and is owned by one or a few companies. I think the philosophy of us and other folks in the ecosystem is that we actually see a future where there'll be many different specialized models for very specific use cases owned by many different companies.

And it's not just us. If you look at some of the recent thought pieces coming out at Databricks or Eric Schmidt, formerly from Google, and also from other venture capital firms, they really see a world where there will be a decentralization of all these different model and model building capabilities across the ecosystem. And we'll see very distinct AI models emerge and also companies with large and unique data source. will see very clear advantages to training their own models as Moats. And so I think we collectively are now arriving at this understanding that the future is going to be where companies both first use external APIs, which are fantastic for building their products, but also want to build some of their own custom large language models.

And that's the second piece of what I'll be talking through today, both how to do it, the tools that exist out in the ecosystem, and what our startup, Mosaic ML, specializes in building. But why would you want to build your own models? We are seeing concerns around data privacy for deployment of these sorts of models across the entire community. But we've seen it's a lot more than that.

So for a lot of the enterprises that we talk to, there are three main reasons why you may want to train your own models.

First is data ownership. A lot of these large pre-trained models that are out there are trained on data provenance that you may not actually know about. And so we actually spent a lot of effort downstream trying to correct the outputs of these large language models. And part of that reason is because we don't know what data went into them. For example, if you're building and deploying financial models, financial large language models, You wouldn't want the data to, you know, if for it to start regurgitating stuff from say reddits, you know, Wall Street bets, right. And so really understanding and mixing in where the data is coming from is really important from a data ownership standpoint.

The second piece is every business has their own content filters that actually make sense. And so you do want to be able to control what content to filter for your particular applications. And more importantly, if AI and ML is truly an integral part of your business model, then you do want to be able to retain that capability in house because that's really where your core IP is. And this third bit that we have heard that's very important for a lot of enterprises is model ownership. You want to be able to own your weights. This allows you to better introspect and explain what type of decisions these models are making. And also, it's more portable. You are not marrying yourself to a third-party provider for all of your applications. And you can actually move that model around to different services and different providers as the ecosystem continues to evolve. And so for many of these reasons, we see a lot of enterprises that need to train models on their own data. and be able to own the models to build their competitive advantages. Business data is your moat. You want to be able to retain ownership there.

There's also a big part here for large language models while understanding inference economics. For a lot of business applications that we've seen, you don't actually need the big AGI GPT-4 model that can do everything under the sun. You need to solve a very specific business And for that, training smarter, smaller models actually reduces the unit economics and cost of inference to a point where it can actually be deployed at scale across your enterprise. Domain specificity is another reason, and of course, data and model ownership, both privacy and regulatory standards.

So a good example of this is we worked with Stanford last year to train Biomed LM, which is a 3 billion parameter language model trained on biomedical literature. And 3 billion parameter is actually fairly small. But because it was very domain specific, we were able to get very good performance for a very small model. At the time, we were state of the art accuracy on the US medical licensing exam. This was last year, and of course, the field has evolved tremendously from there. But more importantly, that model, the 3 billion parameter model, specialized and domain specific for the biomedical literature hit the same performance approximately on the US MLE than Galactica, which is a model that was 40 times larger. And this really speaks to the point where Smaller, more specialized models in the 3 billion to 7 billion parameter range can actually have very strong business value, and you don't need the very, very large LLMs for your specific use cases. Another great example of training your own type of model is Bloomberg GPT, where Bloomberg trained a 50 billion pound or large English model on a combination of open web data, but also their internal Bloomberg data. And it outperformed existing open source models on a wide variety of financial tasks.

Now, OK, so if you do want to train your own models, I think immediately we've heard a few myths that are out there. First, hey, it's just too darn expensive. If you read the news that's out there, GPT-3 took anywhere from $10 to $20 million to train. Certainly not affordable for many enterprises to start prototyping their applications. And the second myth we've heard out there is that it's just too difficult. You have to deal with all the scaling challenges, infrastructure problem, find the GPUs, what software stack to use, how do you train these models, what are the right hyperparameters.

This all sounds too, too difficult. So for the rest of today's talk, I wanna spend a little bit of time myth busting these two myths about large language model training and show you how you can easily and efficiently train a large language. So we actually put out a Twitter poll late last year asking people, how much does it cost to train a GPT3 quality model from scratch? And from the polls, at least on Twitter, very unscientific, it seems like a large majority believe that it takes $5 million and plus to train some of these models. But the reality actually is that training large language models is actually very accessible.

In a blog post that we put out, we show here the different speeds and fees and costs for training large language models across different model sizes. And the most important bit to index on is that training a 7 billion parameter large language model only costs around $30,000. This is at least on the Mozilla Chemo platform. And it takes about $0.00. two to two and a half days or so on 128 GPUs. This is actually accessible. And this is actually useful for many business use cases. And getting up to GPT-3 quality in terms of the scaling laws cost $450,000 to train. So these are not as expensive as folks think they are. And the field, including us, is very, very expensive. are spending a lot of time continuing to bring more efficiency and innovation into the model training space to reduce these costs down even further.

So how are we able to do that?

It's a combination of really good scaling tools that are out there that we've built, a good number of efficiency and stability optimizations, and then using compute optimal recipes. So to train our model, we used Composer, which is our PyTorch library for efficient ML training, plus FSDP, which is PyTorch's fully sharded data power library component for distributing the model across multiple GPUs. What's nice about Composer is that we take care of all of the system-related infrastructure details, and we really specialized it for training of large language models. And it also has ways to add in these different algorithms to improve the efficiency of the training process. And under the hood, one of the challenges with these large language models is that they are too large to fit on the memory of a single GPU, right?

And so you need to have a way to distribute that model across all the different GPUs. And for that, we use PyTorch Fully Sharded Data Parallel. It's one of these execution strategies. What's really nice about that is it's very flexible and not as complex. There's no fancy model pipeline or tensor parallelism needed in order to train these models. And briefly, the way it works is we shard the model and the optimizer weights across all the different GPUs. And then during each training step, we fetch those weights across the GPUs just in time as needed. And this saves a ton of memory and also very straightforward and flexible to use. And with Composer plus FSDP, we think this is one of the most flexible ways to train large language models.

And by the way, all of this is open source in our GitHub repo, MosaicML slash examples slash LLM. You'll see all of our code and our optimal configurations for training these large language models. And even customizing our own model is really easy. You just have to implement a few functions. that will define which models you want to, which model layers you want to wrap with FSDP. And oftentimes you don't even have to worry about this because we've spent the time to optimize these model implementations and provide optimized configurations across many different model scales for you to get started. The other thing that I'll say is that there are actually many open data sets out there. So for example, C4, the version of the common crawl, that are fantastic for as a jumping point to training your own models and then also continue training that model on your own user, customer, enterprise specific data.

The other, I think, innovation that has happened over the last couple of months is that there is a realization that As I mentioned before, for many use cases, 1 billion and 7 billion primary models are sufficient. But also, you can continue training these models for longer than we originally thought and still get really high quality. You'll see this in the Lama models that Meta released earlier this year. You can see here on the y-axis is training loss, and on the x-axis is billions of tokens, so how many tokens this model was being trained on. And you can see that for the llama style models, the 7 billion parameter model, the loss continues to go down as you continue to train. And so it actually makes sense to overtrain some of these smaller models, train longer and not larger, and that'll make them a lot more efficient when it comes to inference. And a lot of it was the genesis behind the llama models have now been instruction fine tuned by various folks such as Stanford. for the alpaca style chatbots. The other myth that we've heard out there, especially from the platform and the infrastructure side, is that it is just too hard to train these models.

As I mentioned before, for large number of GPUs, these models, the nodes can fail every so often. You have to deal with orchestrating these models. How do you stream the data in? How do you get access to the infrastructure? How do you launch these runs? And you'll see, if you go open up Meta's OPT training logs, for example, you'll see that there are many different incidents where nodes have failed, or resuming from checkpoint is a lot of work, and debugging these issues is very challenging.

Fortunately, that's where we come in.

We built the Mozilla ML Platform, which is a full stack, you know, LM stack all the way from the training code all the way down to the infrastructure layer that quote unquote just works. Where we can easily and quickly scale up of the number of GPUs that you need access to. We have all the optimized configurations. Really, for training some of these models, you just need to provide your data. in terms of an S3 bucket or an object store somewhere, and you're ready to go. Whether that is pre-training these models from scratch, continue to train an existing model on your own data, fine-tuning, doing instruction fine-tuning, or RLHF, and then finally deploy these large language models.

So we have a lot of customers that use us as they go along their LLM journey. Maybe they may have started off training BERT large or BERT small models, and they're deploying them in production, and they're ready to level up to larger language models in order to extract more accuracy and really hit that business ROI. And so for us, scaling is as easy as just providing the number of GPUs that you want. We, under the hood, have built a lot of sophisticated orchestration and scheduling to make sure that these nodes are provisioned properly, that they're all talking to each other, that there aren't any hardware issues between them right before you deploy.

And this makes it really easy to scale up to put the same exact configuration and start kicking off some of these large models. We've also solved and spent a lot of time working on some of the unique infrastructure challenges that come with training these large models. And ourselves and the community are coming together to solve a number of them. So today I'll talk through four of them that I think have already been solved by our platform. That makes it really easy to pre-train some of these large language models. So the first one is out of memory errors. These are a bane of existence for large language models because they're too large to fit on a single GPU. And you often spend time tweaking all the different settings of like gradient accumulation, how many GPUs you need, what kind of size model do you want. That makes it very tedious. We've implemented automatic OOM protection.

And so what this means is that with our full stack code base we'll actually dynamically adjust the gradient accumulation. on the fly to prevent out of memory errors. So you actually can just provide the science and the modeling problem you want to solve, which is how many GPUs you want to use, what size of model you want to train, and what your learning hyperparameters are, and will fit everything to fit into the memory of your desired. The other kind of unforeseen fact of these large language models is that if you do need to resume training from a checkpoint, it can actually often take a long time for that training to restart again. This is from the meta OPT logs, where it took about an hour to resume training.

This is where all your GPUs that you're paying for are idle. Some of this time is spent downloading the checkpoint. Some of this time is spent about 30 minutes here just fast forwarding the data loader. Instead, we have implemented a MosaicML streaming data set, which will resume instantly from resumption to reduce that downtime when your expensive GPUs are not being used. We can resume gracefully from node failures and loss spikes.

In fact, we've implemented a kind of watchdog and node doctor type features, where you don't even have to babysit them. If a node fails, we'll swap in a new one and restart training automatically without you even being aware. And then the last thing that we do that's very important, I'm sure for many of you, is efficient large language model training. Doing all the system infrastructure work, all the machine learning systems and optimization work to find the right configurations for you. And we constantly search out in the literature for the latest and greatest in LLM training, and prove them out by doing a lot of rigorous benchmarking. and then implementing them into our code base for our customers to use.

And so while, you know, if you look at these four problems, that sounds very daunting, the great news is that, at least with our platform or our open source code base, we've solved many of these issues, where you can very easily get started training your large language models. The other thing I'll note here is that... MThey're also the open source is also continuing to train and release better and better open source language models. And so this even provides an even better starting point for you to cut down on the time and expenditure required. You can take an existing open source model that's out there, continue pre-training on your domain specific data, whether that's in finance, in biomedical data, in tax law, in other legal domains. and then come out with a model that is both smaller and much more efficient to serve, but also potentially more accurate for your specific use cases.

And so the world, the tooling world, including us, are spending a lot of time now building and open sourcing this LLM training stack in order for you all to be able to easily use these tools to train large language models and eventually deploy them as well. Of course, we have our own flavors with the Mosaic ML streaming data set, the Mosaic ML Composer, and our examples large language model repository. and then our platform sitting underneath as well. And so just to kind of recap and close out, today I've talked through the many different reasons that you may want to build your own models for a model ownership perspective, from a data privacy perspective, or regulatory perspective, to better be able to explain what these models are doing, or introspect into the model weights, or into the log probabilities coming out of the model. to understand why and where it's making its decisions.

And of course, more importantly, API services, while they're easy to get started on and develop your POC, they can be very expensive to scale. If you have tens of millions of users or queries interacting with your product. And so training and serving your own maybe smaller, more specialized models can actually make this entire endeavor actually economical at scale.

And we've kind of busted two myths, right? The first myth that, hey, it's just too darn expensive to train these models. Not actually true. The second one, it's too darn hard. With a lot of the open source tooling that's out there and also the Mosaic ML platform, we made it where training 7 billion parameter large-end grid models on your own data is now very cost effective and very straightforward. So really we invite you all to not be too scared by the training large-language model space, whether that's pre-training from scratch, fine-tuning, construction fine-tuning, or deploying near the end, and start your large-language model training today.

I think in closing, I think we've seen two different markets for this start to emerge. The first one is companies that are deploying BERT models day in and day out. They have real business ROI for deploying these models, whether that's better click-through rates for downstream recommendation systems or better accuracy for your particular classification task. And then now they're looking to eventually upgrade to larger scale models, either large or other types of applications like that. And they want to, are there bottleneck by kind of the system and infrastructure problems and breaking that multi-node barrier. And for a lot of those companies, tooling like this and others that we build can play a great help in unblocking you from progressing up into larger models and making sure that they're still economical for inference.

The other market that we see emerging is the net new applications where with generative AI, everybody wants a natural language interface. And for that, you want the generative AI capabilities, whether that's kind of summarizing paragraphs, generating new writing, or others like that. And for that, training these large language models that I mentioned today is actually not that hard. And it is definitely doable within your budget and within your constraint and your talent. So I encourage everyone to start your large language model journey today. For more information, you can check out our website and also a lot of the open source libraries that I had mentioned before.

So thanks again for tuning in. I hope you're having a great time at the excellent Observe conference. And I look forward to hearing your questions and interacting with you on the community Slack.

Thank you very much.

Subscribe to our resources and blogs