Community papers - resources image

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

Published Jul 3, 2023

Sarah Welsh

Contributor

Introduction

In this week’s paper reading, we discuss “One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning.”
GLoRA is a universal, parameter-efficient fine-tuning approach for diverse tasks. It enhances LoRA with a generalized prompt module, optimizing pre-trained model weights and activations. Its scalable, layer-wise structure search enables efficient parameter adaptation. GLoRA excels in transfer learning, few-shot learning, and domain generalization, outperforming previous methods on various datasets. With fewer parameters and no extra inference cost, GLoRA is a practical solution for resource-limited applications.

Join us every Wednesday as we discuss the latest technical papers, covering a range of topics including large language models (LLM), generative models, ChatGPT, and more. This recurring event offers an opportunity to collectively analyze and exchange insights on cutting-edge research in these areas and their broader implications.

Watch

Dive in:

Transcript

Dat Ngo, ML Solutions Engineer, Arize AI: Awesome. Let me give it a minute before we start but I wanted to say hello, everyone, and welcome to this week’s Arize community paper reading.

My name is Dat, and this is Sally-Ann, and today we will be talking about GLoRA also known as generalized low rank adaptation, which is used in fine-tuning large language models.

But before we do that, we’re going to quickly introduce ourselves. So my name is Dat Ngo, I am an ML Solutions architect here at Arize, I’ve worked closely with technical teams and ML teams at other companies. I’ve been a data scientist for a long time, built a lot of models, built a lot of ML infra as well. Really excited to have you here today at our community paper reading and I’ll let Sally-Ann introduce herself.

Sally-Ann DeLucia, ML Solutions Engineer, Arize AI: Hey I’m Sally-Ann, I’m also a machine Learning Solutions Engineer here at Arize. My background is in deep learning, so spending a lot of time with deep learning models, and embeddings. That’s right up my alley. So I’m super excited to be presenting our paper today.

Dat Ngo: Awesome. Thank you so much, Sally-Ann. All right, so what are we going to talk about today? So really, quickly, I’ll go over today’s agenda. And so for today, we’re excited to walk you through a brief overview of fine-tuning: What does it mean? What does it mean in the context of LLMs? And generally, how are LLMs trained? Then we’ll do a review of all the work that was done before GLoRA and all the foundations it was built on things like, you know, parameter efficient, fine-tuning methods. And then we’ll go kind of deeper into the GLoRA paper, the methodologies, what other findings that came out. And while the paper is extremely mathematically dense, we kind of broke it down in a way that we hope even kind of the average person who maybe doesn’t work in the LLM world can kind of understand. and so we’ll leave you with some main takeaways and some nuggets to take home with you.

So before we get into the paper, I think it’s important to give folks some background and some context into how LLMs are trained and how we get them to do what they do. So how exactly are LLMs trained, and what is the current recipe? This is kind of subject to change, as the field is constantly and rapidly changing. But so far the recipe looks something like this. So, roughly speaking, we have four major stages to how do we get from a large language model all the way into kind of all the stages of the most mature version, which is all the way through kind of reinforcement learning.

So we have these four stages: we have pre-training, we have a supervised fine-tuning which will spin the bulk of the time. But I wanted to set the context for the larger kind of picture here. We will be focusing on this stage. But just want to give folks kind of a higher level view of what it actually takes. So we also have reward modeling which leads into reinforcement learning. So if you ever hear the acronym RLHF, it’s referring to these two stages. So in each stage, we have a data set that powers that stage right? And we also have an algorithm or an objective, or the thing that we’re kind of optimizing in each stage. and from each stage. We also have a kind of resulting model. You hear things like base model fine tune model RLHF model things like that. and then we also have some notes about, you know, kind of like, how you know some notes of like, how long does this take to train things like that? What’s the scale of in context when it comes to all four stages. And so you know, the first stage that happens is the pre-training stage.

This is the first step and it’s really where most of the computational work basically happens. This stage is 99% of the kind of compute or training time when it comes to all four stages. And so really, quickly when we talk about why this takes so much time? Well, this is where we deal with Internet scale data. with thousands of you know, GPUs in a super computer and months of training. That’s what it really takes to get what’s called a base model, or this pre-training stage and we learned in another paper reading that this is where most of the skills and capabilities of an LLM really come from. So how to read, understand semantics, context, things like that. So from this pre-training stage we get what’s called a base model. And so that kind of comes out of here. So when you hear somebody say “base model” it’s generally from the pre-training stage. And so now the other three stages by comparison are smaller in terms of, you know, computational time. It takes fewer, like a few GPUs and takes along the lines of hours and days to perform rather than months to train this one in the pre-training stage.

And so when we look at the next stage, which is supervised fine-tuning. The question you should have is like, okay, why do we need to fine-tune our base models? So when we go from here to here, what’s the whole purpose of fine tuning? And why do we do it? Well, base models are not assistants. Base models just try to guess the most likely, next set of tokens. They’re not super helpful in their own right. And while you can trick these base models into tasks by rewarding or prompt engineering kind of the input text. It’s not the ideal solution, and it doesn’t always work out super well. Just as an example, if you took a base model and you said: write a poem about bread and cheese. It actually wouldn’t write a poem about bread and cheese. It would try to assume the next likely token, which is like, write a poem about this, write a point about that. So just realize that base models need some sort of kind of fine-tuning.

So this is where this stage really comes into play. The first thing that we need to do for our SFT stage is to collect small but high quality data sets. This is usually done by human contractors or humans that collect these kinds of ideal prompts. And they’re given prompts, and you know they produce an ideal response. The order of magnitude for these types of examples ranges in the tens of thousands, so not quite internet scale, but definitely hard data to come by. And so, the idea behind fine-tuning may look like this: This data that’s training or the data set that we’re fine-tuning on may look like this. So step one is a human actually given a prompt. So this prompt may be: hey, write a short introduction about route, the relevance of the term monopsony etc. So a human on the other end is also given step two–this very extensive and intensive labeling instructions where they’re tasked to be super truthful, very helpful and harmless, for example.So they write this response. And this is really the data that the fine-tuning is done on.

And so, we’re not going to go through the rest of the stages, but for the most part, I just wanted people to have context on what we’re covering. And what’s the data that it looks like, and just a larger context of fine-tuning.

I’m going to let Sally-Ann share her screen next.

Sally-Ann DeLucia: Great thanks, Dat. Cool, let’s get into the paper. I think Dat did a great job of laying the groundwork for what fine tuning is, and that’s really what GLoRA ties into. So, one thing I do want to mention is that this paper is brand new. This was just released a few weeks ago. I think it was no more than two. So we’re in the very early stages, I’m sure we’ll see this paper be improved upon, and then also probably morph into even more fine-tuning methods. So the first thing that this paper actually does is it covers other parameter efficient fine-tuning methods that existed prior to GLoRA coming out. And GLoRA actually draws from each of these in its implementation. So, we need to cover each of these at a high level to understand which the paper does itself.

So the first of which is visual prompt tuning. So this is inspired by prompting in NLP. VPT introduces a small amount of past specific learnable parameters into the input space while freezing the entire backbone of the transformer. And so, you might be thinking, prompt is that prompt engineering? Dat was just telling us about that, but it’s a little bit different. So VPT uses what’s known as soft prompts, and these are generated by a small set of learnable parameters, and this is in contrast to prompt engineering which uses these hard prompts that are manually provided. So that’s the main difference between the two.

And so, what these are is they’re continuous embedding prompts that are introduced into the input space. You can see here the P, 0. Those are going to be the injected prompts. And then what we’re going to do is they’re going to be updated during fine-tuning. But the main body of the transformer, as you can see in these visualizations here, are kept frozen. And so this method has two variants. The one on the right hand side is shallow. And that’s when it’s only introduced into that first transformer layer and VPT deep is when we’re going to have it between both the input layer and then between the various transformers. So basically, we’re having those prompts at every transformer layer input. So that’s VPT.

The next one it introduces is AdaptFormer, so this introduces a parallel learnable branch into the two linear layers. So it has this kind of plug-and-play bottleneck module called AdaptML, so that’s what this piece is here, and if we explode that out a little bit more, it has these two branches. So the first branch is exactly the same as the original MLP block which is a multilayer perception. It can be called the Forward Fee network, and so the other piece of this is this lightweight module that’s introduced for the fine-tuning. So this bottleneck structure limits the number of parameters that are included. And then it has this relo activation. So we have the up projection layer, the down projection layer, and then the relo activation layer. And so during fine-tuning with AdaptFormer only the new added parameters are optimized. That means the rest of the model primers are kept fixed. So again. We’re decreasing the number of parameters. You’re going to hear me say that probably a lot as we go through these, but that is pretty much the overall goal of per parameter efficient fine tuning is to reduce the number of parameters that we need to fine-tune.

Okay, GLoRA. So you might have come to our talk a few weeks ago, we did a whole paper reading on LoRA. If you haven’t checked out, I highly recommend it. But LoRA basically borrows from that linear algebra concept of matrix rank and decomposition. So the key hypothesis here is that while these models are very large, they have millions, or even billions of parameters. They have intrinsic low dimension, and the weights have intrinsic low rank. So what this means is, we only need this small subset of parameters. We take our weight updates here, and we’re actually gonna break it into smaller, low rank matrices A and B. So during the fine tuning process, the model is going to add these low right matrices to the pre-trained and then we’re going to keep the rest of the weights frozen. So we’ll only train on these and then the selective updating essentially makes it so the fine tuning process again, more efficient as we’re only going to focus on the ones that are the most impactful.

Dat Ngo: So one quick question I have for you Sally-Ann. Why is PEFT or parameter efficient fine-tuning. Why is it important? And what happens if we didn’t use it, PEFT or these methods?

Sally-Ann DeLucia: Yeah. So as I mentioned a bit earlier with the concept of LoRA, is that these models that we’re looking to fine-tune have millions or billions of parameters that takes a lot of time, a lot of resources, a lot of memory to carry out that process. So data scientists, MLEs are trying to figure out a way that they can fine-tune these models to their specific tasks without having to take on the burden of fully fine-tuning. And that’s really why they’re leaning on these parameter efficient fine-tuning methods. I think we’re going to see a bunch of these start coming out because people are just going to be getting more and more creative to figure out, how can we get these models trained for our task as fast as possible using as little resources as possible? Does that make sense?

Dat Ngo: Yeah, definitely. And then, why we’re focusing on this kind of visualization here. Just for context’s sake, I think if we use something like Adam, for instance with GPT-3, 175 billion parameters. I think LoRA can reduce the number of trainable parameters by 10,000 times when compared to Adam. So when we think about why this is important. When companies or folks need to build their own LLM definitely reducing costs here is super important. So thanks for going over this. I really like this visualization. And when we can review the methods pre-LoRA either you can look at the kind of the pre-trained weights here which I think you’ll get into. But I think if you look at how the technology is evolved, LoRA is awesome because it doesn’t add latency like I’m sure you’ll go over. But also like I said, it doesn’t add latency, and it just makes things just cheaper. So, I really like this kind of visualization, because you kind of leave the blue part alone. That’s kind of the pre-training phase that we talked about and really the only thing you’re adjusting is A and B, right?

Sally-Ann DeLucia: Yeah, we can get a little bit more into what these pre-trained weights are. And so usually, you know, as you mentioned earlier, Dat, like, when we’re fine tuning, we need to update these weights constantly when we’re doing fine tuning. But with LoRA, we’re really going to pass the data into both, and then just going to be adapting A and B over time to get those optimized parameters. So we’re keeping those pre train rates frozen and only updating the right side of this visualization here. And so we’re defining a rank. It’s not really quite well visualized in this, but rank is a really key component to LoRA. It goes back to that again. principle of linear algebra that you know, we just use intrinsically, have low rank, meaning you can reduce it. So if we choose a rank one which is pretty common to do. It can then only have to have a matrix that essentially is: A would be B by R, and then you would be R, by D, so they’re going to be much, much smaller. And that’s much quicker to re-train. Makes sense?

Dat Ngo: Yeah, that makes a lot of sense.

Sally-Ann DeLucia: Cool. LoRA is a good one. It’s quite cool to play around with, too. If anybody hasn’t played around with Laura. I definitely recommend doing that.

Dat Ngo: Yep, and for free. We also posted the LoRA talk on our kind of you know. I think our paper reading. So if you ever, if you did, miss LoRA there’s always a chance to go back and watch it.

Sally-Ann DeLucia: Also kind of foreshadowing ahead. There is also going to be a blog post coming about LoRAto keep your eyes peeled for that one as well. That’s a really cool fine treating technique.

Let’s get to another one here. So, next we have scaling and shifting features. So, as you might guess, we’re going to scale and shift features after each MLP which is the multi-layer perception layer, as well as the what we had attention layer as well as the layer or module during training. And so we’re also gonna perform repair grammarization during inference.

So unlike the other approaches, that’s the set of parameters that we use to fine-tune when we use scale and shift features, the number of parameters remains the same. So it makes it easier to fine tune for things like multitask learning. So I think that’s really important to know about this particular fine-tuning method. It’s pretty different in that respect versus the other methods. And so for this process here you’ll notice that there is this SSF-ADA that’s inserted. And so what this does is, it performs the linear transformation. for this, if parameter, efficient, fine tuning. And so after each operation, the network, we’re the person going to inject these, and then that structure will perform a dot product on the scale factor and then summit for the ship factor. So that’s kind of what’s happening down here.

The purpose is designed to have independent scale and ship factors, so that Gamma and Beta down here to represent the distribution of the whole downstream data set. So that’s super important, we’re gonna come back to this when we get into GLoRA. Next one is fact or factor tuning. Basically the fundamental idea is to represent the change to the weight of a preaching model using compact factorize representation rather than needing to store that full set of weights for each downstream task. So if we’re using this tensorization decomposition, which basically means the weight of the model or tense rise into the single 3D tensor, and then additions are then decomposed into lightweight factors. And so during fine-tuning, only these factors are what’s going to be updated and stored.

Last, but not least, RepAdapter. So this simplifies the structure of the adapter into a linear projection layer. So it’s incorporated into the near projection weights via the matrix multiplication. So there are two common ways to do this. You can see these are going to be the RepAdapter here. And so you can see that there’s two places it can go. It can either go into the multi hit attention layer, or it can go into that forward feed network or that also called the MLP depending on what you pay for your reading and so it inserts these lightweight networks into the pre-trained models, and then the additional parameters will just be re-paramorized to the nearby projection. Weeks after training. So this is what the reparamatized layer is here.

Some pause here. Are there any questions? We just went over quite a bit of material. there is. We can like the papers to these as well. Each of these comes from its own paper. But now GLoRA is going to draw upon them.

Dat, any questions on these ones?

Dat Ngo: Yeah. So let’s say, I’m not a math person. If you could TLDR a lot of the mechanics here. How would you kind of describe just a really short summary of what these are like? It’s a lot of matrix out, but what does this allow us to do? You know, how to go through all of them just like tldr you know. Why is this useful? And for folks who don’t understand the math, if you could maybe put it in a way that they could understand.

Sally-Ann DeLucia: Absolutely. What I’m gonna do is just give a high level. But the main takeaway is, for a visual prompt and prop tuning the main takeaway is that we’re introducing those small amounts of task specific parameters. That’s the main thing there is we’re going to introduce these small parameters. That’s really all it comes down to is these little extra parameters interjected into these different layers. For AdaptFormer this is going to be a little bit more complex. But basically, we’re going to create parallel branches for this adapter. And so what we’re going to do again is keep that original one. And then we’re adding in this new small amount of parameters that we’re actually going to adjust. So the unique thing here is that we have these linear branches. We can skip this one but the matrix decomposition is what’s really powerful about LoRA and the fact that it’s playing on an intrinsic link.

And then for scaling and shifting features. I think two many things you need to account for here is the scaling and shifting of the features, but also the re-paramertization during interface or inference is super important there. And then, for this one, I think, probably one of the most math heavy. There’s a lot of technical things but the real thing that you want to keep in mind is we’re going to tensorize this into that single 3D, hich is more efficient. And then we’re going to be able to decompose these later on. And then for this one for RepAdapter, this is kind of similar to what we just covered. I think that the main thing here is it’s going to simplify the structure of the adapter to just be this linear protection. I think that’s probably the most important take away from

Dat Ngo: Gotcha. Awesome. Thanks so much. It’s super useful.

Sally-Ann DeLucia: Of course. All right. So, GLoRA, we’ve arrived. We’ve just covered six other perimeter, efficient, fine tuning methods. We’re really now getting to the main bed and butter of taste presentation. So this approach really serves as a subset of all those other prior solutions that I discuss. So it can be reduced at any point to any of the previous methods by simply setting the specific tensor to zero. and it’s designed to both fine-tune the weight space and the feature space. And so it addresses some of those limitations that you might but that might occur with the previous method.

So another really big part of GLoRA is the fact that it can be easily expressed as a unified mathematical equation. So here, which ties the visualization together is the equation on the top. And so you can see here that we have W sub n plus w sub, not times a plus b times x, which is going to be our input. And then we can easily then add C. Times, W, not plus d, p W sub 0 plus e, plus B sub W. Quite a mouthful, I didn’t even get it right. But that’s fine. You get the point. So let’s break down this a little bit further. So what do ABC and D even mean well, these are our trainable support tensors for the downstream tasks. And so these can take on different forms so they can either be scalars vectors. Low rank decompositions can hit Laura or they can just be none again, so that we can revert back to one of the previous methods.

And so during this fine-tuning phase, w sub b, and I’m sorry, sub 0 and B sub 0 are going to save frozen. So we’re only going to be fine tuning these orange blocks here. So the A B, C, D, and E, and so each tensor will really serve for a specific function. And so A is gonna typically scale the way you will scale the input, C is going to have functions very similar to bpt deep. So again, focusing on bpt deep not shallow. So we’re going to interject those prompts. at every layer. And then D or E are going to be used to scale and shift bias respectively.

So what this allows you to do is it allows for an expanded search space without increasing the number of parameters significantly. This leads to faster convergence due to weight sharing across these different subnets and so that’s really the technical side of LoRA. That’s the math, that’s what it looks like. And you’re probably thinking, you know, what does this all mean for me? Why does it matter? Dat is going to get to that, but there’s just one little thing I want to cover before we do that and that’s going to be experiments and results, and I think that’s pretty important.

So if you read the original paper, it is focusing on vision transformers. But recently they’ve been doing work with these large language models. So this table here is actually from the Github repo that they have just recently updated. And this is showing the performance on language tasks with pre-chained Llama and using the Alpaca data set for fine tuning. And so you can see here that LoRA has way better performance compared to the other not only the base screen model but also when using LoRA, as you can see here on average, it comes in about 50.7. That’s quite a bit higher than the other methods here. So in the paper, there’s a lot more experimental results. They did their job embedding this, testing it. It does have significant superiority, really. And in parameter-efficient transfer learning. They do few shot learning as well as when it’s domain generalization. So they really test this out and solve superior results. And even the smallest GLoRA model outperforms all existing methods by a substantial margin. I think that’s really important, because there’s been a lot of talk where these models are so powerful because they’re so big. And when we’re using these parameter efficient, fine tuning methods, we’re seeing that these smaller fine-tune parameters are actually outperforming, when you fine tune and every single parameter. So not only are you saving money and saving time, but you’re actually getting better performance, which I think is pretty cool.

Dat Ngo: Great, great overview. Definitely super super awesome that, you know, Sally-Ann knows the math if you ever want to talk to her about it. So really, quickly, I’m going to share my screen come back. And so, like, I am saying, Yeah, GLoRA does perform super super well, and we kind of want to understand. If I had to TLDR this, we have a more complex, you know, architecture when it comes to fine-tuning and that allows us some certain advantages. So really quickly, why does this matter, and what should you be thinking about really? So let’s say you work for a company, and you need to fine-tune a model. Well, the things that you should really consider are, well. How much is training going to cost? How do I optimize that, right? And you know which fine tuning method is going to afford me the best performance? Because at the end of the day. That’s super important. Is this fine tuning method going to increase latency for my generation? I think that’s really important. Sally-Ann, is there anything else that you would add to that? As far as like when we zoom out to like, you know, actually deploying this?

Sally-Ann DeLucia: I think those are really the main ones that come to mind when I’m thinking about this, I think training data often times, you kind of touched on that is something that takes into consideration. And maybe the complexity, or what the specific task is that you’re trying to fine-tune for. I think those are other things that you might want to take into consideration.

Dat Ngo: Awesome. Yeah, totally totally agree. All right, we have about 12 min left. So quickly, we’re gonna do really like a high level overview of where the progression has been made from LoRA to GLoRA and so the first one is really model flexibility. So this is kind of a TLDR on the important parts going from LoRA to GLoRA. So model flexibility and efficiency on the left hand side, you can see LoRA on the right hand side. You can see, LoRA. You can see a much more complex architecture. and a lot more moving pieces. But so how LoRA worked as we freeze, you know, and pre-trained model weights injected low ranked matrices, and we can kind of change that LoRA, you know, if you look at the architecture just much more complex. So, why is this important? Does it allow us more flexibility? It allows us to especially adapt to a variety of tasks and data sets. Meaning if I’m fine tuning on a specific task or data set, if you have much more, many kinds of more levers, it allows you more adaptability.

So an example here, is this is achieved using this kind of prompt module, and of course, like the sharing of weights, and immediate activations, but I wanted to highlight that the next the other one, you know, TLDR that Sally-Ann did cover was parameter adaptation. Why is this important? Well, the layer wise approach really allows us to make more kind of fine-tuned adjustments. So A really good metaphor is imagine you have a radio where you can turn one knob, and it affects like the treble and the bass, well with GLoRA, now you have individual knobs, so you can really determine what kind of sound you want to hear, and it becomes extremely useful.

Like Sally-Ann said, the unified mathematical formula, and so the way I kind of describe this one is with LoRA you have kind of like a fixed blueprint for your house. You can change, like some of the interior decorations. But the main structure is still not changeable. But with GLoRA, for instance, you kind of have this modular home design where you can kind of rearrange the rooms and add a second story, or convert. This transitions from kind of a fixed blueprint to a more flexible design, and you’re starting to see the themes, like GLoRA is obviously more flexible but obviously more complex. You can build a better home that kind of suits your specific needs, so kind of the same as kind of before the re parameterization strategy. Kind of a good metaphor I like to talk about is for LoRA, it’s similar to painting a picture using only primary colors and with GLoRA, it’s like having your full palette. This allows you to have more nuanced work or more nuanced parameters.

So this shift is kind of important in the fact that you have just a larger solution space that you can solution in, if that makes sense. And then really, the last thing is some considerations. So the paper is not perfect, meaning it’s not the ultimate method. There is a kind of an evolutionary search aspect to this, so instead of having to manually do some hyper parameter tuning, there’s a smart search, but it takes a lot of time to do it. And it was a little confusing at first when I was reading the paper, but there’s an increased duration for this search process, but the overall kind of fine tuning method with GLoRA is actually less computation and more efficient. But this one particular step did increase in training time.

Like Sally-Ann did say, this is a pretty new paper, this is a new methodology, and there’s unexplored domains. So this method is still in its infancy, and there’s a lot more exploration to be done. There is adaptability to convolutional layers. So yes, this method can be applied to other models like computer vision. And of course, there’s more complexity, meaning if you’re doing this on your own, and you’re wondering kind of how to do this. Because it’s brand new, and there’s more complexity, hances are you might get stuck somewhere. So just realize that and so these are the key takeaways. I hope you got a lot out of this, I hope the high-level overview and the importance of why fine-tuning is essential to any useful LLM. We built on LoRA and the methodologies there to get us to GLoRA. And then we kinda talked a little bit about the math and why these things are important. But the real takeaway is that there’s a lot of work to be done, and much more research to be done in these cases. I don’t think the final architecture of this is ever going to be complete. But I think this will get more and more complex and smarter over time, as folks figure out a way to fine-tune their models. So that’s what we have for GLoRA. I’m going to pause here really quickly, Sally-Ann– is there anything else you wanted to add?

Sally-Ann DeLucia: Yeah, I actually have a question that we can kind of discuss that I think is interesting. LoRA was initially written in 2017, and it only recently really became popular for these LLM use cases and it really works for any model that has weights, but with LLMs it’s been really popular. Do you think GLoRA is gonna kind of catch up and it’ll be switched out, and data scientists will be reaching for GLoRA before they reach for LoRA?

Dat Ngo: Yeah, that’s a good question. I mean, if you follow the state of LLMs, you’ll realize it’s not monthly that a new piece of news happens, it’s weekly or daily, maybe it’ll get to hourly. But I I probably will guess today there’s somebody doing research on better fine-tuning methods than GLoRA. I think having all those layers was kind of a game changer inside of GLoRA. So as folks start to innovate and think of new ways to to shift levers so that we can train more intelligently rather than just kind of bulk train, you know what I mean? It’ll change. But I’m curious to know your thoughts as well on that.

Sally-Ann DeLucia: Yeah, I think it’s so new it’s hard to say. Because if you look–the paper it is available, but you don’t really see that many other papers like taking it, building with it. So I think it’s very early. Once more people start experimenting with it I think we’ll have a better indication of how things are going to go. But I’m seeing people use LoRA more commonly than I am seeing people use GLoRA right now. But since the research group that had to put this paper out has now kind of shifted into the language space, I think maybe that’s going to change things too.

Dat Ngo: Yeah, totally. And I think it’s risk/reward too. It’s like If you have a bunch of people working on the same thing where you can always go ask for help, but with newer methodologies you’re kind of on your own and primed to figure it out yourself, but definitely excited to see more developments happen in the community.

Sally-Ann DeLucia: The good news is they both have repos. So if you want to try it out, they’ve got a place for you to start. They both have Github repos, and I know Dat linked the LoRA one that I can throw in the GLoRA one. We did just get a question, if you wanna take that while I’m pulling up this link, or I can take it in a second.

Dat Ngo: Yeah. So Chris Murphy asked: Is Laura and GLoRA both multimodal fine-tuning images, etc. Are they both just text? So what I read is that they can be applied for both modalities, but I think specifically for LoRA, it was mostly tested on text, but I could be wrong there. But I don’t know Sally-Ann, any thoughts there?

Sally-Ann DeLucia: Yeah. So they’re both multi-modal. I think, LoRA the benefit there is it actually can be kind of beyond that. It’s like I mentioned earlier. It can be used for any model that has weights essentially, whereas GLoRA has got a little bit more to it that it’s going to work more with the transformer-based architectures. And yes, GLoRA was originally done with images, and I think LoRA was originally done with text, and then they both kind of crossed over and realized that they could do both. But yeah, I totally agree with you, Dat.

Dat Ngo: Yeah, in the section of the GLoRA of paper, they said they haven’t done any work with neural networks. So like Sally-Ann said, I think we will see developments in the computer vision space with LoRA. Awesome. Well, thank you so much for everyone who attended. Thank you so much for the questions. We’re really excited to have you at our paper reading. This will be on our website. We’re really excited to have you all here, and we look forward to the next paper reading. Thanks so much.

Sally-Ann DeLucia: Thanks, everybody.

Share

Suggested reading

Text reads: The Illusion of Thinking Understanding the Strengths and lImitations of Reasoning Models via the Lens of Problem Complexity

The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning

Reads: Accurate KV Cache Quantization with Outlier Tokens Tracing

Accurate KV Cache Quantization with Outlier Tokens Tracing