Arize:Observe 2023

Train, Serve, Fine Tune, Repeat – The Future of the ML Lifecycle with Anyscale, Tecton and Others

Aparna Dhinakaran, CPO and Co-Founder at Arize, talks with four ML leaders about adapting to the world of generative AI.

Aparna Dhinakaran: Uh, well, hey everyone. Welcome to another amazing session here at Observe. Today we have amazing leaders from across the MLOps and ML Lifecycle joining us today. Um, why don't, I mean this, this is actually an amazing panel of incredible founders speaking today. Would love to maybe just kick off with some intros and, and then I think there's a lot of questions around how does this world adapt to the generative that we'll be talking about today?

Uh, my name's Aparna, founder of Arize and Manu, you wanna kick us off with intros?

Manu Sharma: Yes. Hi everyone. My name is Manu Sharma, I'm Founder, and CEO at LabelBox. LabelBox is a data engine for developing a wide range of AI applications, from computer vision to national language processing systems, and we're excited to be here. Thank you for having me.

Aparna: Awesome. Uh, Robert, you wanna go next?

Robert Nishihara: Hey, I'm Robert. I'm one of the co-founders and CEO of AnyScale. We are providing infrastructure for scaling and productionizing AI applications, and in particular, generative AI applications. So think about image generation, large language models, um, and the kind of workloads you can imagine, uh, people scaling and productionizing on top of our stuff is our, uh, training models, fine tuning models, um, deploying the models and serving. As well as doing batch inference and, and both online and offline. So, uh, those are a lot of, we see a lot of computer vision, natural language processing, and, uh, really excited to be here.

Aparna Dhinakaran: Go for it, Mike.

Mike Del Balso: Okay. I was waiting for you to say, Hey, everybody, I'm Mike. I'm one of the co-founders and the CEO of Tecton. Uh, at Tecton we make feature infrastructure. So if you go to, uh, like a Google or a Facebook, There's literally hundreds of people building all of the infrastructure to support the data pipelines, to generate fresh features, keep them up to date, to serve them to your models in real time to help, you know, keep track of them, monitor, share them. Uh, and these are all important things to build, uh, into an ML platform. Uh, we think of what we're doing at TEON as building the best, uh, feature infrastructure in the world, um, especially for real-time, uh, production ML applications. We make Feast, which is an open source feature. Restore and we make Tecton, which is an enterprise feature platform. Before doing Tecton, I actually worked with Aparna at Uber. Uh, and we built, uh, we worked on the Michelangelo system, helped start the ML team and built out the Michelangelo ML platform there, which is pretty interesting.

Yep. Those were great days. Um, awesome. Chris, last one. Let's go.

Chris Van Pelt: Cool. Hey everyone. I'm Chris Van Pelt, one of the co-founders of Weights & Biases. So at Weights & Biases, our mission is to build the best developer tools we can for machine learning engineers. We're probably, uh, most well known for our experiment tracking platform and we've since built a number of additional product offerings to help with data lineage, uh, collaboration with reports, um, model registry. And we just released, uh, a new set of functionality around CICD and automation. So, we love ML engineers, we love building tools, and, really excited to be on the panel today.

Also one last thing, in a prior life was working on a very similar problem to Manu at CrowdFlower figure eight. So I've been, um, in this space for a long time and it couldn't be a more exciting time to be in this space. I'm really looking forward to chatting about this new era.

Aparna: Awesome. Well, I've actually been really looking forward to this session. So thank you all so much for being here. Uh, it's really hard to get leaders across different stages of ML, all in a room. So my one ask, the reason we're doing this live is a lot of great hot takes will be very appreciated.I'm sure the audience will love it.

So you guys all started your companies before the wave of LLMs really became hyped. And I think there's a theme that I'd love to talk about, which is where does MLOps or the ML life cycle adapt in the age of generative? Um, so first, maybe a little bit of just a general question, are you all already seeing LLMs in production or is it mostly still experimental? That's for anyone here.

Robert: So we see a lot, at least from what I've seen, um, we do see both, but a lot of the projects are new initiatives. A lot of the projects that we see are people, um, businesses trying to figure out how to, they know they need to succeed with ai Right. To succeed with LLMs and, and, uh, use LLMs to add value to their product or their toast.

But there are a lot of different ways that could look and they're still exploring and prototyping, trying to iterate quickly, uh, to see what the right product, UX, and, um, value they add to their product is. So a lot of these are just brand new initiatives that are, you know, kicking off now.

Mike: For us, I think we may, we may live in a slightly different kind of stage in the life cycle than a lot of other kind of ML tools might fit because, uh, at the feature platform is like the whole value of it is, We help you do something in production really, you know, at scale in real time. If you have an ml uh, application where you're not really, there's not this concept of like being in production and supporting your, uh, you know, your user's real time interactions or transactions or something like that, uh, you know, you don't have this productionization problem that we solve.

So we see, I think I suspect less of the LLM stuff because we don't really work as much in the kind of like experimental phase trying to figure things out in the very kind of earlier stage of sophistication or maturity of use cases. However, we see. Uh, you know, the use cases that we work with are people building, trying to detect fraud, people trying to do real time pricing, realtime recommendations, personalization, stuff like that. And we're getting those people starting to ask us a lot about LLMs, like: Hey, can I also use everything that I've built for my fraud model or for my recommendation system and kind of use it to do this extra little thing. This, this other thing? Reuse the same stuff I've already built for an additional use case? but we're not seeing a lot of people having figured that out yet, but they're starting to poke around.

Aparna: Got it, got it, got it.

Chris: Yeah, I'd echo the same, like if you, if you look at your Twitter feed or like Hacker News, it seems like everyone's making really cool stuff, but in theenterprise, everyone's talking about it. But, um, I think there's a ton of, of big open questions still around how to, how to bring this into, um, production and, and which path to take, whether it's using a third party API or, or, um, kind of building your, your own model or fine tuning.

So it's, it's early days, but, uh, Yeah, it's, it's wild to, it's like four months ago chat, G p T was released. We were like really excited by, by G P T two or just the um, G P T three APIs before, but it seems like, you know, just in the last four months now it's, it's the topic. So I, I think, uh, we'll, we'll see a lot more, um, in the coming months.

Manu: And I see two, uh, kind of categories of applications. Uh, and, and, uh, so the number one is of course, um, uh, consuming these LLM APIs, right? So you've got a model and you're essentially using that to power your, um, existing application that is. You know, transcribing, uh, a chat, like a conversation or summarizing a conversation and these kinds of use cases are definitely in production, I mean, we are actually using so much of our products. It's likely that the products that we are using are interacting are. Uh, have these features now powered by all lambs. It might be one of the fastest, uh, technology adoption that I have personally seen across the wide range of applications.

The other category is the businesses or ML teams customizing these LLMs, whether it's in prompt or in context learning or fine tuning LLMs. And those are still very early days. I mean, remember, like for most businesses or for most machine learning teams who were just entering 2023 with their year plan got out of the window after chatGPT’s moment, it probably will take quite some time for, you know, all these teams to adapt, um, align and, uh, execute on the new plans.

Robert: Actually Aparna, I want to add a little nuance to my original answer there. I think there are two main areas where we see, um, LLMs already in production in a meaningful way. So one is we work with a lot of AI startups and so these, you know, many of these companies were started, um, specifically. To you, like because of LLMs. And so when they're shipping their product, you know, they are already in production. Uh, and then the second is we work with a bunch of companies that have NLP based products. Like their whole product is, or, you know, an as important aspect of their product is, you know, processing documents. Or, uh, you know, improving search or things like that.

And in those cases, um, you know, where you're already doing natural language processing that, you know, productionizing, LLMs in that context, uh, is a natural step and, and clear how to do that.

I think the cases where we see a lot more exploratory, um, prototyping about how to productionize LLMs are the cases where you don't have a, you know, your product is not naturally, uh, a language based or NLP product. And so, but you still believe you can use LLMs to make the user experience better, make it easier to use your product, lower the barrier to entry, create language as a way to interface with your product. Those are the kinds of ways, um, areas where we see more exploration.

Mike: Aparna, we're not gonna let you get past the first question. Let me add one more thing. I think that,even the question: what are we seeing go to production? I think It's kind of interesting because there's, there's kind of like a parallel question, which is like, What are the sets of problems that enterprises are thinking of taking on with machine learning? Yeah. And has the introduction of all this LLM stuff affected what that scope of problems is that makes sense for someone to build?

And, you know, what the LLMs have done is kind of moved the competitive frontier on a variety of use cases, meaning like, The, what's actually cool, what's actually the, the like best possible experience? It’s kind of different now. It's, you know, way better, let's say for certain types of things. But there's also, that comes with the added complexity.

Like for me as a company, am I gonna be able to build that now? Well, that's way farther away. So maybe now I’ll choose to buy from a marketing vendor who will automatically. You know, optimize my text instead of me building that internally from some internal team.

And I think that's actually probably a meaningful impact that we should consider along the way. You know, so you're, you're, you may have an LLM in production because you're using a vendor that you know, is powered by lms, but, but maybe you're not actually building that, that system internally as much anymore.

Aparna: Just a tangent on the first question at least. What use cases do you think we will see less of powered by typical or traditional LLM. Like, you know, OpenAI in the last session was saying sentiment classifications are dead, you're gonna use an LLM to do that. And you know, but there's gonna be all these new emergent skills and new ML use cases that are now opened up because of these, these models.

What dies, what stays? Um, what do you guys think?

Chris: I think the most impressive thing about LLMs is their ability to be so flexible, right? The the zero shot or few shot learning where you just give it a little bit of indication of what you want it to do and suddenly you can have a sentiment classifier, or a variety of different use cases. So I think that's what is the scariest thing about this, right? Because you're asking, all right, well what, what traditional machine learning modeling does there need to be? I think, you know, in the near term, there needs to be a lot still. I mean, there's very specialized use cases, especially in the, in the computer vision world. Although the things happening in the generative vision world are also really exciting. But I guess my answer is, I don't know. It's definitely changing. I think. I agree. Sentiment analysis–why spend a bunch of time, uh, if you can get the same accuracy out of few shot prompting of these large, powerful models.

Robert: I do think people will still want to know the sentiments in a lot of cases, right? They just, they won't have, um, but if they'll be able to tackle more ambitious problems and it'll be kind of a, a simple, you know, sub-routine or, or thing that just gets taken to account by a larger overall model. I think one thing I don't think will show up so much as just like a lot of the NLP subroutines, like part of speech tagging probably won't be as important because that was always just an input to, you know, other downstream tasks that you're doing and, and, uh, you know, you won't need to go through that intermediate step anymore.

Manu: Yeah, I agree. I agree with that. And I would actually, um, expand the scope from, uh, sentiment analysis to just like data categorization. I think data categorization by and large is gonna be solved or is already being solved and it will be solved. And classifiers, you won’t need to build those models yourself likely. And a variety of NLP tasks traditionally like that, like parts of speech, anti recognition, things like that are also, we are looking at basically at the frontier, like at at the edge, where it's gonna be mostly solved by lms, if not one, you could use multiple lms, uh, in conjunction to do doing the task.

What's really fascinating to me is that, you know, there was. There's this whole rule-based approach that emerged, uh, to dealing with, um, these, um, NLP tasks. Like you can programmatically label data, um, uh, you know, with reject queries and things like that. I actually think that that is all being disrupted with LLMs because these rules, um, are in many ways, um, Some things that LLMs can actually do already, like LLMs can do very great reach, inquiries itself, uh, things like that.So I think we are in a very interesting time, uh, this next few months, this year where, um, we might see a major disruption of most of the ways people went about building the classical NLP systems.

Aparna: Got it, got it, got it. Okay. I'm gonna take a little turn unless…

Robert: You wanted, uh, hot takes, right? So I have a question for, for Manu. Um, you know, are we gonna, in the future, are we going to, um, label all of our data or are we just gonna generate all of our data?

Manu: Yes, so I think, you know, I've always thought of the problem as like there's some level of supervision required at the moment. And it was really the focus of where the supervision is. Uh, a few years ago, the supervision was primarily on labeling the data. Uh, and doing it very meticulously. Um, now the supervision is more, uh, moving towards more on the feedback. Like the models are already kind of out of the box, like, great, and you're kind of like prompting and saying like: Hey, I want you to do X or I want you to do not do X or Y or Z. And so the, the supervision is, uh, the, the focus is moving. Um, and I think, um, it seems to be that in the near term, There will be some level of supervision still required, and, uh, even if you, let's say completely get rid of supervision, they're very active just monitoring your ML systems and, uh, fine-tuning the knobs to ensure the AI system is doing what you want it to do, would probably become a new, uh, definition of supervision. Kind of like you're in a control room and you have all the knobs and dials and you know, the nuclear reactor is sort of working and you're just mostly monitoring the system, and that becomes perhaps a new definition of supervision.

So, yeah, I think labeling in many ways should be seen as, Converting unstructured data to some structured information, um, to, um, to power downstream applications. And there are not lots of use cases where you do need, like, you know, a sentiment analysis, like a tag, uh, you know, one of the enum options to power logical systems down downstream. And so I think that will largely be automated. I hope it gets automated. It'll be super fun and easy for everyone and make machine learning accessible.

Chris: Yeah, I'll echo like when we don't supervise these things, that makes me really nervous. Right. I wanna make sure we're, we're, we're watching and measuring.

Manu: Yeah, exactly.

Aparna: Robert, you're stealing the next set of questions Straight outta my mouth. There's a bunch of questions we've had from our users about how do LLMs change the MLOps or ML workflows today. One of the questions was what Robert asked you, so how does this impact labeling, will labeling happen? I guess the question for, for you all is, um, what does experiment tracking mean in this era? Is that tracking prompts now? Like what is, what does that look like in the era of LLMs?
Chris: Yeah. Uh, well, when we look at LLMs, we three primary use cases, right? There's, there's one, the l l m creators and we're, um, fortunate enough to be helping l l m creators like OpenAI and, and Cohere actually build their foundation models.

But there's only gonna be so many of those right now. How many will create from scratch is, is still an, an open question, but we don't think it's enough to just focus all of our business on the l l m creators and stay with kind of traditional, uh, experiment tracking. Then we have LLM fine tuners, right, where you kind of take a foundation model and then fine tune the learn weights for your specific dataset.

I think this is, this is still emerging, but it's not clear how. How common this is gonna be, either with the, the latest iterations of these models. Um, it seems like maybe solely with prompting or as the contexts get bigger, you can achieve a lot of the benefits you could get from, from fine tuning. So then there's the last area, which I think is the most exciting area right now as a company that's building developer tools, which is the, the prompt engineers.

Often you know, my background is more application development. I love making things. Now we've enabled all of these creators to use AI and machine learning without having this traditional background. But they still need a lot of new tools and, and ways to ensure that their applications are doing the right thing. It's tempting to just trust that it's gonna work out. But you need new tools now to actually measure how these models are functioning, um, within your application to, to have some measure of quality. And it's a pretty big paradigm shift to go from application development where you could define logic and then write a test to say, if this, then it's okay if that, then it's not okay.

So now we have this probabilistic outcome that is somewhere in between and you, you know, you need a way to, to actually measure it and watch that. So, uh, you know, this makes our jobs as a company that's building developer tools harder, right? Before we, we could focus on this one, um, persona that we knew really well, and now we have a much broader, um, set of personas, but, uh, that's, that's an exciting opportunity for us. So, you know, in London we just released our new, um, Weights & Biases prompts offering, which is really about tools to help with debugging chains. So we have integrations with LangChain and OpenAI, tooling. We expect there to be a number of other frameworks and systems that, that, uh, get released to help with chaining these things together and creating kind of tools and agents. Uh, so yeah, we're, we're holding on for the, the ride, but it's definitely, um, an area we're investing a lot in and, and think it is, it is gonna really change the way applications are built.

Aparna; Got it. That's, that's super exciting. Um, Mike, I guess I'm jumping over to you. What, how does this impact, uh, feature stores and, and I guess bigger meta question for you, is there still tabular data in this world?

Mike: Yeah. This LLM stuff affects almost all MLOps tools in a big way because there's a lot of changes to the core workflows and there's also a lot of, there remains a lot of uncertainty about the workflows. For example, Chris was just saying, we don't know how much people are gonna be fine tuning things in the future, it could be the case where it's just like foundation models are what everybody depends on. Or it could be the case that everyone's gonna be fine tuning, you know, for their use case internally in their organization. And then it's a question of does that happen in your organization or through some external API, some vendor? Those, those different outcomes pretty dramatically affect, like what is the workflow and what are the related tools, like what should everybody here be doing to help the users, right. In our world. Uh, so you asked, Hey, there's still tabular data and, uh, The answer is, yeah, like tabular data is gonna exist anyway, and uh, I think the, the question, I think the question that you're implying though is like, Hey, is it still like very important to have and be working with tabular data in machine learning context, or can we just dump all the raw data into these LLMs and just hope they kind of figure things out.

Yeah. And um, uh, what we are finding, so, you know, what is, what is a feature, right? Like a feature is just, uh, some data that you pass to your ML model to help your ML model understand the world in some way, right? So we kind of think of like, the features is kind of the way that ML. Models understand the world. In LLMs, they kind of have, two ways that they can get data in that they can kind of get information into, into them. One, they can memorize stuff. If I go on chatGPT right now and I say, what's the capital of, you know, England, I'm sure it knows that. It doesn't have to look that up somewhere. But you can imagine, there's a lot of questions I would ask it where I have to pass in information. Like, what is the sentiment of this user ticket? Well, I gotta pass in the details of the user ticket, right? And you know, right now that task, let's say that task is done–Or respond to this user ticket–that's done by a human today. And what is the information that a human needs to do that task, right?

We don't just give them the ticket, but in reality we also give them a whole bunch more information about the customer, the customer's history. And we tend to do a lot of summarizing. Organizing that data internally, what platform is this customer on? How many times did they submit a ticket in the past 30 days?

What is the historical sentiment of this customer? A whole bunch of different summaries, refining all the different data to make it, uh, something that you can kind of like easily reason about. And you know, I think of LLMs as kind of like, uh, I didn't make this up, but as somebody said this and I thought it was really nice, like, Calculators on words, they're like a reasoning engine.

And so to the extent that we can provide information to them that is summarized and easier to reason with, uh, that is, uh, going to make the l o m's job much easier. For example, today, I don't know how it could pass in all of the history of every possible thing that has happened related to a customer in my organization into an LLM to make like a really fast, uh, inference. I'm sure stuff like that will become possible in the future, but a lot of the use cases we deal with are super real time things that have like a crazy amount of context. And the data scientists put a lot of work into, uh, into summarizing things to distill these features into very high information potency signals.

Um, so tabular data is still around, uh, and productionizing this data, and productionizing the systems that keep all of the signals that we know about the world, our customers, our items, our everything, uh, up to date and fresh, to be able to give this important view of the world, the data, uh, into the models, the reasoning engines such that they can make high quality decisions. That problem still exists no matter what kind of like reasoning engine you have, whether it's a traditional ML model or if it's an LLM, um, that will always be around.

One of the biggest, uh, one of the biggest kind of changes though, and I, and, and I think we should like, it's kind of an interesting thing for us to all reflect on, is that you probably don't have to train a model anymore. And so a lot of this, the tooling that has gone into like model training, uh, I, you know, that that's the kind of offline online consistency stuff. Well, the offline training workflow is not that important for, uh, a lot of people who are not dealing with model training, obviously. Yeah. So that's a thing that affects us directly, and so we focus a lot more on the kind of like the production side of things that will always be relevant for every use case. Does that make sense?

Aparna: Yep. Yep, yep, yep. And I think one of the points you made, and I think Carlos is actually asking in the chat, let me ask you: In the context of low latency, high throughput, ML based solutions to LLMs actually scale?

And I think at least what I've seen, love your take, Mike, is that it doesn't replace things like ranking models and recommendation models and, you know, ads, uh, click through rate model today in the world of today with that high throughput, it doesn't. Do you think it might?

Mike: It totally does not today. Who knows? Yeah, it could in the future. Um, a lot of these, so a lot of these systems, like I used to work on the ads system, the ads click through rate model at Google. Super optimized, crazy scale, driving a lot of revenue, highly productionized kind of stuff. And, uh, you know, if I replace that with an LLM, that would be, uh, like I'm imagining millions of times higher, uh, compute and more expensive to run, uh, at that type of scale. Not to say that it can't be done in the future, we just haven't gotten there technologically yet. I'd be really excited for us to get to that point. It's gonna be an interesting, uh, position to be in, but a lot of, uh, ML systems today, particularly like rankin, fraud detection, personalization…A lot of these things have a lot of, um, kind of optimization built into them. That just means that they're not the first things that LLMs are gonna, are gonna take over A lot of the text stuff that we've been talking about that have historically been, uh, had poor performance from traditional ML systems are, are likely to be, uh, first on the chopping block I think. And, and just one more thing, clarification on the training thing. I was just saying, it's not to mean that there's no training, it's just, it's just decouples who's doing training and serving, there's someone else doing training the model now. So that you don't need to have a workflow that works for both of those at the same time, uh, as commonly anymore.

Aparna: Got it. Got it, got it. Okay. Well this, this is actually a good segue into a debate and wanted to ask you all. So there's closed source models. You know, the ones Open AI the Anthropic, there's open source, you're all seeing the Hugging Face, you know that. And then there's these in-house elements where people are building them to be hyper personal, training it on their own data. Where do you think we go in, in a year from now? Do people. Build our own in-house, do the open source stuff, kind of, kind of pop off? Love to hear where each of your takes are on this. You wanna kick us off, Manu?

Manu: I'll start with the caveats that I think this is one of the, um, most strongest exponential technology trends. And so humans are gonna be extremely erroneous in making predictions on what will happen next. Uh, I've been slow, wrong in so many times, uh, in the last few months and years. So, uh, with that said, I think there's a room for both. Um, I've, my hunch is that, um, for in the near term the closed source LLMs are gonna be able to provide lot more like a product that offers all kinds of bells and whistles, like content safety, moderation. Some enterprise features, scaling the APIs and so forth. I think, open source, uh, is likely going to catch up, right, right, quickly, uh, just because there's so much tremendous interest in the world. And, um, and, and so, uh, it is going to, you know, um, the world is gonna invent all these ways to do it, uh, themselves, um, as well.

Uh, I think what's really interesting to think, a little bit more broadly. The compute, uh, cost, uh, or model training cost has been declining, I think effectively at 78% or something per year.
I’m quoting research by ARC Invest and, uh, that they have conducted recently. And if that trend continues, which I believe it has been continuing for the last, you know, 10 years or so, is that, um, cost of training GPT-3 or 4. Might be, uh, like $300 to $400,000 by 2030. And uh, you know, so if I look at that, um, and uh, you know, maybe there's some error margins. We gonna just, it feels very natural to me that the world is just gonna create all kinds of models. Um, and they would wanna have freedom to do so, um, uh, across the board. So I think we are definitely on a journey where we're gonna see a lot of LLMs across the board or a lot of like these foundation models and, um, And perhaps the, the nuance or differentiation would be the corpus of data and the, uh, the instruction set that is fine tuned or reinforced or things like that, techniques and so forth.

Aparna: Got it. Okay. So Manu’s kind of on the closed source for the beginning market for both. Robert, what, what's your take on this?

Robert: Yeah. Well first of all, you know, it's important to remember that this is really right at the start, right? This is very nascent. Um, now what we see, um, and I think I. Over time, there's going to be just significant r and d effort going into making these models, um, cheaper, you know, faster, smaller, and sort of expanding into supporting some of these, um, kinds of applications that Mike mentioned are not necessarily practical or cost effective today.

Right? So that's gonna be, that's gonna grow. Um, what we see today, uh, is a lot of people actually combining, you know, using both open source models as well as, um, APIs, right? And sometimes stitching these together in the same application using tools like Link chain. And so, you know, you may featurize your data using something like, uh, or, you know, create embeddings using Hugging Face and open source models there you may query and api. Um, and these things all kind of integrate together now I think where what we'll see is, uh, significant progress on open source models. I think for open source models to be pervasive. They have to, they'll have to be good. Quality will be the number one, um, factor there, determining if that's if they're competitive or not.

I think based on the rate of progress that we've seen so far, of course, you know, these things are hard to predict. I think that'll, uh, that'll continue to grow. And I think you're going to see, you know, as the stuff gets easier to do, as it gets cheaper to do, you're going to see a lot more of it.

It's, uh, happening as opposed to less of it happening. People, training models. Um, I do think it's important to distinguish between the foundation model creators and then the consumers of these, you know, these sort of people building products using these models for the foundation model creators. And there are companies like, you know, open AI and cohere, um, that use, you know, that use Ray to, uh, raise open source project that we're building to train these models.

Or as well as open source versions like GPT-J, you know, which are trained using Ray. Um, there the focus is on. Training, right? And creating the underlying model. And there is a tremendously difficult and expensive compute problem there for everyone else, you know, for everyone who is trying to build a product using these models, where we see the shift is that we see a shift away from training the model and creating the underlying foundation model to fine tuning the model and serving the model. So that's where we see a huge growth in need, basically in infrastructure needs. This is people who want to take existing models, not necessarily spend a hundred million dollars to create the underlying foundation model, but actually to take a pre-trained model, fine tune it using their own data, right, specific to their problem, create a better use that to create a better experience for their product, and then actually ship that and deploy that as part of their product. And so the infrastructure challenge gets. It's gets much more challenging there as well, right? Because you are now deploying models that might require a bunch of GPUs and you weren't, you know, the way you were doing it before, didn't require that, right?

You might be needing to ingest a ton of data to feed all of the GPUs to do fine tuning and to do that in a performant way. And I think one of the changes here is that we see a lot of increased, you know, interest in cost efficiency, basically how do I serve these models in a performant way, using a bunch of GPUs but not spend all of my money doing that? And I think that's an area where certainly,a lot of the work on infrastructure, a lot of the work that we are doing is about, uh, making it cost effective to do that.

Manu: Hmm. On cost effective piece. Uh, I was recently speaking with an AI leader at one of the biggest fortune, um, might be Fortune 50 companies. It was really fascinating to see like if, if it, like, certainly GPT-4 or equivalent system can do all those tasks for conversational AI, but their scale is so big where like 70, 80% of Americans are using their products. And if they were to bring that system at that scale, it would cost them fortune versus, um, like the alternative, be a custom, uh, natural language system that they have purpose built for it, because the, the sheer amount of usage that there is, latency requirements and so forth. And so I think, um, uh, you know, to your point Robert, um, the cost and influencing is gonna likely be a very big factor in the short term. Deciding which direction people go to, um, solving a problem ultimately has to solve a problem for enterprise or for business.

Mike: I think it's also interesting to just think about like, there's a lot of, there's a lot of uncertainty about what happens in the short term, but like where are we converging to? So I think one big question is the question is the fine tuning thing gonna be a big deal, uh, longer term? And either it is or isn't, let's say isn't, let's say foundation models just become so powerful.

You never need to fine tune anything and you can just put everything in a prompt. Um, then who's gonna make the foundation models? Is it gonna be, uh, the best foundation models gonna be like sta uh, you know, StabilityAI, open source things, or are they gonna be, you just get it from an api, from op. Uh, from open ai, and my guess this is kind of like my hot take, is the closed source foundation model providers, just because of like, like economic incentives, are always just gonna have more budget to build a better one.

They're gonna be a step ahead. They're just gonna always be more powerful. But open source is gonna be coming after them and is gonna be like a close second. What does that mean in terms of like performance? Well, let's say, uh, you know, let's say the Open AI foundation model up here and the, the best open source one is kind of can do 90% of the things that it can do, it’s just a 10% performance decrease. Well, what does that mean for the average person? Are they, is it gonna make sense for them to use the open source one or the closed source one? Uh, that, that's kind of the, that's a very interesting question. That 10% difference in performance, how does that translate to the average ML use case? Which one should you choose? And I suspect there's gonna be some people. For some use cases, there's definitely there. This is the performance of this ML This model is the competitive frontier for what they're building for like, you know, if it's like a fraud, you know, I was talking about fraud a lot, that's gonna save you directly more money if you have your, your model be slightly more accurate, Right?

But there's sometimes where, you know, say sentiment classification. It works. You know, if you just have the 90% one, it's good enough. And so that 10%, uh, performance boost, like what fraction of ML use cases are really going to benefit from that. And then for those folks, say it's, say it's also 10% of ML use cases that are like, Hey, I can't use the open source one, I gotta use the highest performance one. The closed source ML foundation model… How valuable is it for them to use that? Is that, is that so much, so much value that, uh, that continues to fund an Open AI on an ongoing basis? Uh, that's a really interesting question to me. I suspect that it actually is a lot of value for those people who really need that extra little bit of performance because when they need that extra little bit of performance, it's because their business demands it. They're making more money or saving a lot of costs based on that. And so that will drive significant revenues. And I think we see the same kind of parallels with a lot of other kinds of tools in the MLOps space too. Uh, you, you just see, you know, and more advanced organizations paying for the best tools to get the best workflows, even going from batch to real time, which is the thing that we talk about all the time. And opponent's. The thing we did at Uber, we had the ETA Mo Uber's ETA models used to be all batch.

And, uh, and then we were like, you know what? This is probably just gonna be so much better if we get real-time traffic signals and we do everything in real time. And we did that. And it just, it just completely changed the accuracy of everything and it was a really big deal. And I think we're gonna see a lot of, a lot of those kind of step function changes in performance that will translate to value as you continue to go from open source to the additional level of performance from the closed source model.

So that's my take on it. But that's all under the assumption that like, fine tuning goes away and the fine tuning sticking around longer term, that's a lot more uncertain for me.

Aparna: So I got a perspective from a company, you all probably know the name, um, but they're using LLMs to generate content. One of the, the takes they had was use cases where, They need to be hyper personal is where fine tuning's gonna matter. Where it goes back to your point Mike of it's our competitive differentiation. We're an AI first company and if we're just using the same off the shelf models with everyone else. What is the, you know, is it just our UI? Like, is that our competitive differentiation? Well, it's gotta be the content we produce. We have to be really, really good for the users that we provide.

And this person kind of really believed that, you know, their reason for fine tuning was to make the content that they generate hyper personal. Robert, I guess a question for you, what are the use cases you're seeing where people are fine tuning today or anyone here really?

Like where do you guys think fine tuning will matter?

Robert: I want to add on to that point. I think if you are in this world where everyone is using the pre-trained foundation model, right? And you're building your product on top of that, if you're building a business, You have to be adding value somewhere. And you know, similar to what this person is saying, what is that value? Now, one possible answer is that you have data that other people don't have, right? And you're bringing that data and injecting it into the machine learning in some way, right?

That could be through fine tuning, that could be through retrieval. That could be, you know, done in a number of ways. But fine tuning is the big one that we see right now. Another option is that you're adding a lot of value, um, in the application or the product that you're building on top of the model and the model is not the key parts. And in that case, that means you're building a lot of logic around the model in the application layer, say at the serving side. And there we see, uh, more complexity shifting to model serving. And there it's typically not just, um, you know, a model behind an endpoint and that's the whole story, but rather, multiple models combined together, you know, composed with other application logic. And there we see a growing need for just more expressivity in the, uh, serving logic.

So I agree with this other take that you are, you're referring to.
Now you, you just asked, can you say the question again? You asked what fine, where do we see fine tuning being used?
Aparna: Yeah. Like can you give me anyone here, gimme examples where they've seen, okay, It's totally worth it for us to fine tune.

Mike: And also I'll add on to the question. I'm really curious what you guys would say. Do you think it's gonna stay like that in the future? Uh, or do you think that use case for fine-tuning is gonna kind of like vanish over time?

Robert: Yeah. Um. So right now we, you know, we see fine tuning being used to improve performance for, uh, a lot of the time you have data that is not, um, specifically what, you know, chatGPT was trained on.

Right? It'syour user data or it's like you're getting people to, you know, you're doing code-generation, you're trying to generate, um, you know, things where. Your, the, your workflow, you know, you have a lot of data that you collect from your users about what good looks like, uh, and the out-of-the box performance is just not quite there.

Aparna; YeahI've seen language be an issue.You know, in just some examples to give you guys language, be an is issue. I mean, Harvey's an example and legal's one where maybe LLMs weren't trained on that type of data. Uh, I think especially in like Biology, Chemistry, kind of these deep science applications, like there's gonna be applications there where you'll need to, to fine tune. Um, and I think.

Robert: Uh, exactly. Medical domain is an important one. Just anything that's really domain specific where, you know, the stuff that you are, you are doing or that you are, uh, you know, you need, the language model to do is not, um, stuff that is already out there all throughout the internet.

Manu: I think fine tuning will be important and it will continue to concentrate, towards the most lucrative areas, uh, for businesses to invest in. And the reason for the way I think about it is that when you think about, um, Uh, what makes a product comp very differentiated? Right? So in a world, let's say you buy insurance from like Progressive, Geico, you know, there's all these insurance providers and they kind of generally have similar access to data. I think the differentiation is really like how they underwrite a critical case and they all have biases and because they have found those biases to be, uh, uh, advantages to them. And I would think that they would want to incorporate the data they have, but also the biases and the, and the anthology of the business and the way they make decisions to, uh, infuse it to, into an AI model that specifically is uniquely theirs.

And that's what I think is going to make these companies interesting and prevalent. Otherwise, we all have, you know, a base, like a single API that does everything, which seems to be, uh, There's a non-zero chance that is possible, but it's unlikely because humans adapt and they find kind of the nuances. Like, I'll take a code example, coding as an example. One of the most interesting characteristics in software teams is, um, uh, the, that they ultimately, like some, some of the high performance software teams tend to have a style guide in terms of how they write software, uh, what the patterns are and so forth. And, you know, what's allowed, what's not allowed. And um, and I think we have yet to see a product for that. Like, Hey, you know, this is how LabelBox writes, uh, software or this is like how Ray writes software and the examples or suggestions are hyper-specific to the organization.

And so, I think we are very early innings and I would not bet against human creativity, which is there will always be these nuances that ultimately makes interesting products and services in the market, which requires, which will probably require fine tuning.

Aparna: Well, these were really great. Takes on, on fine tuning. Let's see what happens in a year from now. Um, okay. I'm gonna change gears a little bit before I hop into some audience questions. First off, to even know if you need to fine tune or, or change the LLM, you need to know if you, you need to be able to measure or troubleshoot LLMs and, and their responses. Um, Chris, I guess this, this is a great one for you too, and especially with your new launch. How do you think troubleshooting LLMs will look like? Um, and anything you've seen people do, do out there already?

Chris: Yeah, so, uh, last week at our event we released an integration with OpenAI evals. So OpenAI created this, you know, central repository to define a number of different ways to evaluate LLM output. So on the one end, if you're doing like sentiment analysis or multiple choice question answering, you have the truth. Maybe you, you worked with Manu to get it at LabelBox and then you can evaluate how well the model's doing straight outta the box.

If you're doing something like summarization or, or, um, something that's a little more open-ended. The state of the art is like to have GPT grade it. Which is wild to me, but I can't imagine, like if I'm using GPT-3 to generate responses, then I can have GPT-4 grade it and have, you know, some meaningful signal there.

You probably also want to work with Label Box or a labeling provider to do, you know, additional, um, labeling of those results. But, uh, I think this is like a big muscle that companies are gonna need to learn how to flex if they, if they really wanna, uh, put this into production and then know. With certainty as they modify their prompts or the underlying models themselves change that the accuracy of the output is good.

I think, you know, OpenAi Evals is one of many frameworks that are emerging to, to actually do this. But I, I think that is like, uh, kind of table stakes to, to really deploy this, to deploy this in the production. So it's definitely something we're gonna be watching closely, as I imagine Arize, will as well.

Aparna: Absolutely. It's a topic we're deeply interested in. All right, last question from me and then I'll hop into audience questions. So, you guys are leaders behind some amazing ML companies today, and data scientists and ML engineers who are watching this really wanna know how should they be preparing for this new era?

Um, what should they be doing or reading up on or kind of evangelizing inside of their companies to bring forth the era of generative? Any advice you can share for the kind of young data scientists, ML engineers in the audience?

Manu:, I'm not a machine learning practitioner, or I actually mostly learn machine learning by trying, uh, new things, uh, and learning online, uh, going through all the courses and so forth. So, and I continue to do so. Frankly,this group here produces amazing content. Um, uh, you know, so I like, uh, Weights & Biases blog and Label Box has really interesting content as well for data centric workflows. And how to automate them. And so, yeah, I think it's really just, you know, finding the sources, um, for, for in, uh, for, and continue to learn.

Chris: I would say make a demo just like play. It's so easy now and it's so fun. I wish I had more time to just go tinker with these things and make really cool demos. Um, hundred percent. That's really how we approached a lot of, of our product development and Weights & Biases is we'll have an idea and we'll think, oh, that's cool, and then we'll, we'll make a demo that isn't production ready yet, but it gets the message across and then people start to play with it and they talk about it and suddenly there's enough momentum to to get it deployed within the organization. And it's so fun, so fun.

Mike: I think that's really true actually. It feels so much easier to make a demo, like compelling demos and it's, it's easier to get to like the point of magic with LLMs than pretty much any other technology that I've used before. And the thing that. I think I would just like caution everybody on or like, or maybe just encourage everybody to keep in the back of their mind as they're going through like the process that Chris was just saying.

As you know, then there's a phase where you want to like productionize an application. Eventually you're like, Hey, I've. Got this cool demo. We want to actually, you know, put this in production and serve all our customers with it. And I think there's just, we've learned a lot of, like, lessons through the ML ops journey through the past couple of years, things that have worked and things that haven't worked that are interesting to, to keep in mind so we can like, try to, you know, uh, avoid some of the mistakes that have been made.

And I think one of the. Bigger things that I would try to avoid is having an organization where you set up a completely different technology stack, completely divorced from the existing business, divorced from the existing data stack, the existing data processing and production systems where you, because you don't want your LLM team suddenly responsible for all this stuff in production that uses completely different tools than everything else and they get no support from all of the rest of the platform teams in the organization. That's a pattern that we saw a lot of the ML platform teams and bigger enterprises go through. And I think that was, that was kind of painful and we're seeing a lot of people undo that over time. And a lot of these ML platforms, now being kind of built more on top of the existing platforms in the data platforms, compute platforms, stuff like that in the company. And maybe just kind of as you're going from demo to production, worthwhile keeping that in the back of your mind. So that can be a less painful down the road process for you.

Robert: Yeah. Um, I think the demo point is extremely good. The part that I would add to what people have said is to, you know, take a step back and I think there's gonna be a lot of value to understanding some of the fundamentals here. So things like, You know, understand what linear regression is and how that works, right? Understand what, like, you know, why you wouldn't train on your test data or what overfitting is, or some of the basic statistics or probability concepts. I think there's gonna be a lot of foundational material like that that is gonna enhance your understanding and ability to use these tools as you build your demos and, uh, you know, help avoid some common pitfalls.

Aparna: Awesome. Awesome. All right. I got one or two questions from the audience. And then seriously, this session's been amazing, so thank, thank you all for, for being here. Uh, Lamia has asked, from what you've seen, what are the biggest hurdles, challenges to productionization of LLMs, technical and otherwise?

Manu: You know, one of the biggest challenges that I think the industry's gonna solve, I think it's an really interesting problem, is, um, within an enterprise, like, let's say, let's take an example of an LLM app, like question and answer. Like, you know, if you're in a really large organization, You could, uh, in theory train a LLM on the corpus of data for knowledge base.

You don't need to go through like five or 10 different tools and kind of like look for things and get questions answered. Let's say you make one LLM, the thing is that well, not every single person in the enterprise. Should be able to ask, uh, or get access to financial questions or financial insights or some secret projects that are being worked on and, you know, et cetera.

So how does an LLM, uh, let's say once you zoom, this is an LLM. How does it actually work inside the large organization with, um, that manages that, that maintains the, uh, access controls, um, that under, that has audit trail, um, and that has kind of the right access patterns to the corpus of data for being refresh, for it being refreshed.

And so I think we are gonna see, uh, uh, I, I think this, that it's an unsolved problem, but it's, uh, something that, um, probably like industry needs to look at it. Uh, and it will, should look at it very quickly.

Aparna: Got it. That's a really good point. Another quick hot one. Oh, go ahead. Did you wanna answer that one?

Robert: I was gonna add that the infrastructure challenge, you know, in addition to what Manu was saying, the infrastructure challenge is far more challenging here because of the scale that things are happening at and the need for cost efficiency. And I think, you know, this is an area where if you can provide infrastructure that enables you to do this in a performance and, you know, cost effective way and at the same time, Enhances, you know, your team's like the ML team's, velocity, ability to ship products quickly. That's an area you can add a lot of value. And this is things like making sure developers don't have to think about, you know, scalability, make sure they don't have, they can just focus on their machine learning logic, right? They don't have to, once they develop the model that they can easily, um, start serving it, right, they can start deploying it and not have to hand it off to another team or have another tech stack built. You know, that's different from training and, and you know, for serving. So all of these things, challenges around scalability, infrastructure, uh, cost efficiency are, are where we see teams spending a lot of time.

Aparna: Got it, got it, got it. And with that, that is time. Uh, thank you everyone for attending this, this amazing session, and thank you so much to the speakers. You guys gave a lot of hot takes and a hypothesis in a space that is still kind of exponentially growing.

Mike: Let's check back next year and see who's right. I think if there's even one prediction from this, you know, webinar that's accurate, I'm gonna send you a cake.

Aparna: Well thank you all so much for being here if you have any further questions for these amazing, uh, founders, they'll be in our Slack community. You can ask them any other hot takes, uh, but appreciate you all being here and hope you enjoy the rest of the day at Observe.

Thanks for your time, everyone.

Subscribe to our resources and blogs