Evaluate a Model: Traditional Machine Learning and LLMs with Sumble’s Anthony Goldbloom

Fireside chat with Jason Lopatecki, CEO at Arize, and Anthony Goldboom, CEO and Co-Founder at Sumble, on evaluating models, traditional ML, and LLMs.

 Jason Lopatecki: Welcome to day two of Arize:Observe. Today we have Anthony Goldbloom, who was one of the founders of Kaggle. Feel free to introduce yourself, Anthony, I feel like there's a whole generation that, would love to hear about Kaggle. And so, so tell us a little bit about what it, what it was, what it, what it is today, um, and just maybe your initial journey too. Anthony Goldbloom: Well to start off, um, my background is econometrics and statistics. Uh, um, and I feel like in the very early days of DA data signed, Data science. Um, you know, my, my definition at least of a data scientist was somebody who knew more statistics than the average programmer and more programming than the average statistician. Before Kaggle I had worked for the Australian government doing econometric forecasting. So forecasting gdp, inflation, and unemployment. But I very much met that definition. Um, where I was a hobbyist programmer, I used to use things like R and Python, um, as opposed to MATLAB and some of the other more statistically-focused tools. I used to read this newsletter, which I think is, is probably still around, called KD Nuggets. Um, and it was this very small community of people, uh, that did, you know, sometimes it was called data mining more so the machine learning. Um, uh, and I used to read those newsletters and, um, in those newsletters they used to advertise every year there was a conference called KDD. Um, and the KDD conference would have a competition attached to it called the KDD Cup. And so, you know, like most academic conferences, you could get a spot at the conference to speak by, uh, submitting a paper, but they had this other track where they would put out a problem from a top performers on that problem, um, would, you know, get a slot at the conference. And I always thought that was very elegant. Right. To, to measure the quality of, um, you know, somebody's contribution. By, you know, having a human review. Their papers somewhat subjective. Always thought the idea that you could run a challenge, and evaluate, you know, whose algorithms were most accurate was sort of a very elegant way, you know, to divvy out, uh, spots at this conference. And now, uh, conferences like Europes and CVPR have, have become. You know, I guess more prestigious, but at the time, you know, KDD was a very, uh, you know, sought after conference to, to get the opportunity to speak at. And so that, um, really where Kaggle came from was bringing the idea of the KDD Cup into, uh, into industry. We started running machine learning challenges. So we'd have companies like Allstate, uh, the insurance company, Obviously the vehicle manufacturer, um, run challenges with us. So Allstate wanted to predict, insurance claims: you know, who's gonna crash their car. Um, very much a bread and butter challenge for an insurance company. So they would put up their problem on the Kaggle website and data scientists, machine learners would compete to build the best algorithm. It worked very well because we had some holdout set where we knew the answers so we could evaluate,people's performance on against real data. Um, and the winners would win prize money and in exchange they would, uh, you know, give the intellectual property to, to Allstate, um, or whoever the. I think, um, Kaggle ended up being a very effective way for companies to discover new solutions. Um, in Allstate's case, they were using techniques that were very common in you know, insurance circles, in actuarial techniques and so forth. And they learned about, I think it was gradient boosting machines, won their challenge. And in the early days of Kaggle, data scientists came from lots of different backgrounds. Backgrounds like mine, econometrics, but computer science, bioinformatic, bioinformatics, and really each of those disciplines had their own pet techniques. In econometrics we're mostly using things like generalized linear modeling. So, um, Uh, you know, things like logistic regression and so forth. Um, the computer scientists were often using tech, more decision tree based techniques like random forest and a little bit of neural networks. Uh, some support vector machines, you know, but had their own set of techniques. The kind of magic of Kaggle in the early days was it brought all these different groups together to compete on the same challenges and you could see objectively what would do well. One thing we can talk about a, a little bit later is like the evolution of techniques from, you know, random forest through the, the transformer based models that we have today, and what sort of inflection points we had over time. Um, but that was, uh, how Kaggle got its start. Um, as the company grew, we added on more services, so we added on a hosted note if you're for those familiar with Jupyter Notebooks, we added on a hosted Jupyter Notebook. And initially we saw people trying to share code in our forums, and so we thought, oh, why don't we give them an environment with a like standard set of Python packages where they can easily reproduce each other's code. And so that's where the hosted notebook came from. And really it was meant to be a support to competitions, but then we saw, saw people starting to use it on non-competition datasets that sort of hack around our systems to bring in other data sets. And so then we launched a public data platform, which allowed anyone to upload their own data sets. And so, um, three main components to Kaggle today are machine learning competitions, the hosted notebook and the public data sets. Jason: I feel like one of the things, I mean, a lot of us started on Kaggle, like started building models and testing ourselves on Kaggle. I do think one of the things I noticed in my earliest days was just, the result difference sometimes between when you build and when you submit. Um, and, and so can you talk a little, maybe just why you want to test blindly, like, why it's important to continue to test and compare people in, in a blind fashion, and how you grew from that, you know, the importance to that in the beginning. Anthony: Yeah, totally. Um, one really large problem in machine learning is something called overfitting. Um, and what, what will very often happen is your trainer model. Um, and, you know, you might get really good performance on that model. Um, and then you'll apply that model on a, on a. You know, uh, you know, future, you know, data that hasn't come in yet, or some other sample, um, you know, out of the same raw data set that doesn't perform as well. Now what Kaggle used to do, which was quite clever, is, um, we would have a leaderboard that showed people how they were performing or. What we do do or what they do–sorry, I'm not there anymore, so I shouldn't say We–is, uh, have a live leaderboard that shows, how participants are performing in real time so you can get a sense for what your performance is. But then what we do is we actually throw it away. Or the test cases, you know, on that live leaderboard. And we rescore people on a second test data set where nobody has had any feedback. And not only that, um, if you put in a hundred submissions, uh, to the competition, you can only pick two that are, that are evaluated on that second test data set. And something that we would see very, very often is, um, you'd have, I don't know, somebody in, call it ninth place on the public leaderboard. And then when we switched over to the private leaderboard, they were in 100th place. And what is, what had happened was they had overfitted to the public leaderboard, but they hadn't built an algorithm that generalized. And one of the, so it's a very powerful thing about cargo that the test data set is under lock and key. And so you cannot. Physically overfit. Um, and I've always had this suspicion, you know, very high fraction of people who do their first cargo competition have that dynamic where their public leaderboard score, uh, performance is much higher than their private leaderboard score. And it makes me think that when you have people building models, um, you know, inside their companies building research papers where the test data set is not under lock and key. How many of those cases are we looking at models that are actually overfit and don't generalize well, in real life, um, just based on what we saw at Kaggle as to how often it happens, it makes me think that a very high fraction of the e world algorithms in deployment are actually overfit. Jason: I think the interesting point there is just, and I think it's true independent of the new world of LLMs or anything we're doing, is you just need, you need to test yourself. You need to test yourself and have, you know, solid tests behind it. That give you ideas of if what you're trying to do is working in your deployed scenarios similar to how you built it, um, and that, and that kind kind of hinting on the new world stuff. Can you talk about like the, just the evolution, like what did you, how'd you, you've seen so much. In, you know, the last decade. Like what, what, what have you, what's, what's been the evolution for you and what have you seen recently? And like, love to hear that story. Anthony: Yeah. So I’ll go through the full history as at least as I have seen it. Um, as I said, in the very early days of Kaul, you had all these different discipline folding under this one data scientist job title. And, uh, you went from people using a whole lot of random techniques to random forests became very clearly, uh, the dominant technique early on in Kaggle. It won everything. Then in 2012, um, uh, there were two developments. Um, there was. An algorithm that people were excited about called Gradient boosting Machines that had nicer properties than random forest, but were harder to implement and Tian Chen from the University of Washington, built a really beautiful implementation of gradient boosting machines called XG Boost, which by the way is still the dominant way to train models on small structured data sets. It still dominates small structured data sets surprisingly. The other thing that happened in 2012 was, uh, you know, 2012 is often called the Anis morabolis for deep learning. It's the year, um, Jeff Hinton, um, and some of his grad students, uh, got a spectacular result on ImageNet. Um, so ImageNet was a computer vision benchmark, or challenge not run on Kaggle, run by a Stanford researcher named FeiFei Lee. And so that was, and, and what had happened was for the longest time people were, um, you know, there was excitement about neural networks, but they never really, you know, they had some successes. Like they, there was the famous digit recognition. Mm-hmm. I think it was used by the postal service to read zip codes and people at the time thought, oh, they can read numbers, and you know, look at all the amazing things your networks they're gonna be able to do. But, um, they, never quite lived up to the promise. Uh, there were a set of academics, um, uh, now very well known, Jeff Hinton, Yann Lecun, who, who never lost faith, uh, and continued to iterate on those techniques. And in 2012, they had that breakthrough result, on ImageNet. Results of tra being able to train deeper neural networks, on GPUs, uh, and so forth. And really since 2012, um, what has happened in machine learning has just been breathtaking. So the first set of use cases were computer vision use cases. So class classifying images, ou know, this images a dog, this has a cat, uh, doing object detection, so being able to count Coke cans in an image or mm-hmm. Uh, you know, things like that. Segmentation. So, you know, these pixels are a sky, these pixels are road. and a lot of this has led to things like automating, you know, radiology, um, autonomous vehicles, uh, all, all were unlocked by that first wave of, you know, advance, uh, with deep neural networks. In 2015, you had, uh, you know, 2012 to 2015, it was very hard to train a deep neural network, but then you started having frameworks. The first one that really made, uh, deep learning, accessible, was Caris. And then, uh, from there, you know, really it became possible for, you know, not just PhDs at the University of Toronto, but anybody to train deep neural networks. The next big break is a set of big breakthroughs. I think we had, um, diffusion models maybe around 2017. Which is, you know, ultimately what has led to a lot of the now the generative image work. And then in 2018, of course we had the transformer, um, Yeah, we had the tension is all you need and BERT. And since then, I think progress has just been breathtaking. Um, you know, bigger and bigger and bigger models leading to more and more spectacular results with GPT-3 being the big milestone. And then chatGPT being the aha! moment, you know, the consumable version of that, that has opened so much to the world's eyes to what's possible here. Jason: And I guess this is, uh, I mean maybe this is philosophical, but do you think we've moved from, from an era of data scientist area, era of kind of AI? Anthony: So certainly if you were to kind of classify the, if, if data scientists are, you know, some of them are statisticians and some are bioinformaticians and some are computer scientists and machine learners, it's definitely the case that the Computer Scientists have won, if that makes sense. Neural networks come, neural networks come outta computer science. Random forest gradient boosting machines. So all the dominant techniques really have come out of that. Call it computer science, discipline. I will say that, um, It is still the case you know, transformer models do do extremely well on a lot of problems. They, they're working on computer vision problems. They're working for obviously the NLP use cases. I think they're also really good as I understand it, for problems, you know, proteomics and, um, And, uh, other problems like that. It still is the case that for smaller structured problems, you know, could be forecasting or problems more like that, it is still gradient boosting machines, uh, that are dominant. And then, you know, this concept of transfer learning, which is so important for unstructured use cases, does not work so well for structured data more generally. I think the breakthroughs on unstructured data are breathtaking. The breakthroughs on structured data have not yet materialized. Jason: You, you recently started Sumble, and, you know, have gone from Kaggle as a platform to, to now as a kind of a user of different tools. Um, what, what was your journey there? Like how do you decide what I think one of the big problems today is what tool do I use for what problem and people jumping to the, the coolest tool, but maybe jumping over something that would work. It'd be simpler. Um, how do you, how do you decide what to use and for what problem? Anthony: Sure. So maybe just on Sumble briefly. We're a very early stage company. It's, uh, just me and Ben Hamner, who was my co-founder at CTO Kaggle. Um, At the moment, although we, we have a couple of new starters starting soon, um, and we're aiming to build a large repository, uh, for external data sets. Um, one of the things we've been playing with is this idea of can you extract structured information out of unstructured data?And it's been interesting. We've been playing with, um, I guess we've been, as we've been kind of experiment, as I said, we're very early experiment mode, but we've been trying, um, everything from regular expressions through to GPT-4. And it's been interesting to sort of see the boundaries of what you know, what, what. Uh, techniques work well on, uh, um, you know, different types of, I guess, NLP problems. Um, so our use cases have been things like, um, entity recognition. So that's like Jeff Bezos, the CEO of Amazon, stood down last week, if you have a sentence like that. An entity will be Jeff Bezos person, uh, Amazon Company. For extracting entities out of text. Um, we found, um, GPT-3, we haven't rerun this on GPT-4 but GPT-3 got an F1 score of around about, um, 0.35, whereas training, fine tuning our Hugging Face model, not a particularly big one, we’re using long, long form text, so we use long form, which is a long token version of BERT. Um, fine tuning that model with about 500 labels. Got an F1 score of, so, at least, um, for that use case, GPT-3, wasn't quite, um, your performance wasn't strong. We haven't rerun this, um, on GPT-4 but of course another issue is cost. Yeah. It would be much too expensive across our corpus of data. To, uh, to, to do inference with GPT-4. But what could be really useful is, you know, I was, our data labeler very tedious, took me a heck of a long time to label our data. If GPT-4 is actually, meaningfully more performant than, uh, GPT-3, and it is actually good enough, probably what we would end up using it for, while it is as expensive and slow as it is. It's for labeling. Jason:I feel like that's interesting, I mean, there's a whole interesting labeling thread there, but just to confirm like long, long forms, kind transformer based, maybe would be in the LM category, but just fine tuning and building it yourself on your own data set versus the general GPT-4, That was the A/B thing you were doing? Anthony: Yeah, exactly. Um, and long form I don't think would classify as–it's not a particularly large, um, it's, it's probably a BERT size model, actually. Dunno how many parameters, but it's not large. We trained it on. Um, let me think. We fine tuned it on one T4 GPU, and it would take us about, I think an hour to fine tune on 500 labels, maybe less, um, on not a very powerful GPU. So it really was not a very large model. But what we would do is we'd take a model, uh, long form was a, a language model, so, you know, Trained to predict the next word, and we just fine tuned it to, um, extract entities out of our, you know, out of our specific type of text Jason: Got it. Um, one, one question. As we're going through this, there was one question from the community, thought we'd answer it, um, which is, how do you count for interpretability and explainability when, when evaluating, uh, LLMs? Any thoughts there? And happy to also note what we see too. Anthony: Yeah. I mean, um, actually you probably have more information, more data here than us. Yeah, it is something I worry about. One place where LLMs are doing unbelievably well is summarization. So taking a large, um, chunk of text, um, and doing what we call abstractive summarization, where you're not just not extractive summarization, where you're pulling a key sentence out, but abstractive where you're sort of synthesizing what is being read, and it's, I think it's really, um, certainly the smaller BERT size models cannot get close to matching the performance that we get from, you know, call it a GPT-4. That being said, I really do worry about is it actually summarizing what is in the text or, uh, uh, is it sort of drawing on some knowledge it has from elsewhere and or hallucinating? And Jason, you and I were talking a little bit, um, before we got on about ways we might evaluate whether it really is a good summary. And one of the ideas that you shared, which I really liked, was this idea that you could use another LLM to evaluate the performance of the previous LLM, which I think is something we will definitely try in order to get more confidence that the summarizations are actually accurate. But I'd be curious to hear your, uh, answer to this. Jason: Yeah, I think, I think that's, that's one area that's growing quite a bit, which is use use one LLM to, uh, evaluate the response of a, of a second of, uh, so, so OpenAI put something out called Open AI, open Evals, which is. Uh, kind of a way to try to standardize this, this kind of evaluation approach for, for LLMs, what all they're standardizing is the prompts that you use to do that evaluation. I would say there's, there's kind of like that one angle which is using LLM to help you understand your LLMs response. I would say there's one other approach which we're, um, seeing. And it's very different than traditional explainability. You know, the traditional structured explainability slightly works by like, you know, filling, fiddling, or flipping, uh, inputs and, and seeing the outputs and that it's just for structured, uh, for, for LLMs that that just isn't what you do. For what we've been seeing is actually extracting, um, latent structure. Um, for, for those responses. And so we have some examples we've done with large language models. Uh, Databricks just trained one called, Dolly. Kind of confusing, it's not DALL-E, uh, but there's a big, there's a big one, uh, trained and, and you can extract, um, latent structure and embeddings either by sentence or, uh, by paragraph of those responses. You can follow the latent structure through, you can compare it to baselines. And, and with of the stuff we'll be putting out, you can actually follow the conversation through latent space. So what are the subjects? How does it progress? A lot of the way these models learn is, is or understand things is in that latent space. Um, and building tools to help understand that latent space, I think is a powerful way of understanding how these things are, are, how their thought processes are progressing, um, different ideas, how they move between ideas. I think there'll be a lot more early stages there. Yep. Um, certainly an open, Anthony: It's certainly one of the big open areas, right? Jason: Uh, huge, huge open area. Um, uh, keep keeping, uh, going on some of the things, some of the other comments we had was just on, on, you know, on Sumble, I think that was really interesting. I think one of the points I, I wanted to highlight that I think you, you highlighted here, it's a big debate across a lot of, a lot of teams is, um, Is it GPT-3 or GPT-4, or do I fine tune my own internal model? Or train my own internal model. And I think in your case you landed slightly on the other side. Like you, you have your own data, you're getting better results. Let me, you know, a very simple, smaller model, but, but you know, still transformer based, still language model, but I, I, but training it could, could beat the general. The general model that Anthony:That was for GP PT three. And my strong suspicion is that if we reran that on G P T four, GPT-4 probably matches or outperforms. That wasn't even 3.5, it was three. Um, um, my strong suspicion is that, GPT-matches or outperforms even. So we still would need to train our own traditional. Um, uh, or fine tune our own model because, uh, GPT-4 is too expensive and too slow. Um, and so the next step for us, um, you know, assuming GPT-4 performs as I expect, um, um, and that it, you know, would, would produce an F1 score be better than what we get, uh, from our, um, Hu human radars or, or, or, um, you know our own fine tuning, uh, then we would use it for labeling, and then we would train a model to match those labels. Andthen eventually, um, I could totally imagine if, if GPT-5 was um, Fast enough and cheap enough that we wouldn't bother, training models. Um, um, so, you know, it's, it's, it's sort of hard to predict the arc of progress. There's certainly a direction that I know a lot of people are working on, which is exciting, which is, um, around distillation. So you take a large language model and then you. Ask it consistently the same question, like help extract, uh, um, the name of people out of this text as an example. You don't really need the full power of a large language model in order to do that fairly repetitive tasks. But it's nice not to need to have to train a model from scratch. Um, And so, you know, an area that a lot of people are working on are, okay, let's send, send the LLM, you know, 50 examples of how to extract a person out of your specific type of text. Um, and then what they, what they'll do is distill the large language model to just take the portions of it that are relevant to your task. Um, and I think that's an ex exciting area investigation because hopefully you get the performance and the, uh, of an LLM and the simplest, you know, that you kind of avoid You're really needing to train a model. Um, uh, but you get the cost and inference speed of a, a smaller model. So that's a direction that, you know, I'm keeping an eye on and, and could give us the best of both worlds. Right. Um, that's, Jason: I thought I'd do one hot take at the end here. You probably have 60 seconds to answer and it's okay to punt on this too. Kaggle was acquired by Google, you've been watching Google for a while. I'm kind of surprised that OpenAI feels like they're ahead. I mean, Googles should be ahead. Like, what's, what's your take on, on the OpenAI versus Google? Um, I mean, amazing team there at Google, but why does it feel like OpenAI has leaped ahead and how, how's that possible and you have 60 seconds. Anthony: I mean, Google and open AI are both chock full of incredible talent. Yeah. Um, one challenge that, you know, Google and Microsoft and other large companies have is, um, they have legacy businesses to protect and, um, yeah. And I also think a unit of time that an engineer at Google has spent, you know, spends doing something is less efficient than a unit of time, than an equivalently talented engineer at a place like OpenAI can do things. So I think. Google has a strategy tax , and the bureaucracy of a large company, you know, slows them down. Um, so give equivalent talent, um, it's not surprising to me. Jason:But I'm surprised, still, you know, Google's amazing. It'll be fun to watch what comes outta that, that team soon. Well, I think that's, we're probably coming to time here. Thank you all for, for, for joining us. Thank you Anthony, for, uh, for, for joining us here. Um, awesome conversation and, you know, enjoy the rest of the Observe and, um, appreciate the time. Anthony: Thanks for having me. Great. Awesome.

Subscribe to our resources and blogs

Subscribe