As Machine Learning (ML) teams within Etsy develop increasingly complex models to help connect buyers and sellers, the underlying platform needs to adapt and grow accordingly. From investing in observability within distributed systems to new tooling for ML practitioners, Kyle Gallatin and Rob Miles talk about the significant changes their team made to the platform and strategy in response to the growing investment in deep learning. In this presentation, Gallatin and Miles discuss how they evolved their ML platform to handle deep learning models for search ranking at scale.
Rob Miles: We're here to talk about how we improved support for deep learning models in Etsy's ML platform. So, quick introductions first. My name is Rob Miles, I'm a senior engineering manager at Etsy, uh, managing the ML platform team. Um, Kyle, want to do a quick intro.
Kyle Gallatin: Hey y'all. I'm Kyle Gallatin. Um, I'm a machine learning engineer here at Etsy. Uh, also on the machine learning platform team. Rob's here, team focusing specifically on ML model serving. Um, yeah, I'm excited to talk about this today.
Rob: Yeah, so the model serving team is one of many teams, um, within a larger organization called ML enablement within Etsy. Um, and so the ML enablement team we have as our customers, any ML practitioners, Etsy, who are generally, um, applied scientists and ML engineers working in different product orgs. Um, the kind of uses for ML Etsy are search recommendations, advertising, trust and safety. In this case we're gonna be talking, um, mainly about the ranking use cases, which covers search and recommendations and advertising primarily. And those are the biggest users of ML at Etsy.
Okay, let's get started. So, this wonderful graph–which will probably not winning any presentation prizes–but what we're trying to convey here is just how explosive the growth of deep learning has been at Etsy. So if we go back a couple of years ago, beginning of 2021, There was, I think maybe like a handful, like a couple of, um, TensorFlow models that people had started to play around with. Now it's pretty much the, um, the vast majority of models, uh, are being developed on deep learning frameworks.
And what we're gonna cover in this short talk is some of the challenges that this threw up for, um, for the ML platform team, um, during this two year period as we dealt with this rapid change. So, yeah. As more teams started launching deep learning models, um, we discovered that, uh, tuning these workloads for latency costs and costs became much more difficult.
Some of the reasons for this, so, there was a learning curve for the modeling teams themselves in terms of, um, the TensorFlow framework. So things like the TensorFlow transform layer. Um, there it's very easy to write badly performing code in that layer. And maybe your model performance seems, um, you're from an evaluation performance seems fine, but when you come to look at the latency of it, you realize that it's actually not meeting the SLOs for that, or it's very expensive to serve.
We were also on the ML platform team. Very surprised at how, uh, the extent to which these heavier models started to stretch our kind of knowledge of the underlying compute platform, which, uh, for the model serving, um, system is Kubernetes. And so we really needed to, um, To get grips with the, a lot of the, um, sort of fundamentals of how Kubernetes was working to scale these models and also to, um, to figure out how to use tools like TensorBoard, um, to actually dig deeper and discover the cause of some of the, um, expensive computation in, in the models.
And the other problem was that while we had needed to tune, um, some models for latency before this had never been that big a deal before. And so, um, the, the teams have become used to doing this very late in the process of actually testing out their models and seeing what the real world latency looked like.
Um, and that became more of a problem when the tuning period was taking sort of days or weeks for us to figure out the cause of this and that. Ended up with product and experimentation, timelines being pushed back. Um, just noting that, um, if you check out Etsy's Coders Craft blog, you'll find the blog post that this talk is based on.
You'll also find one, um, which is referenced over on the right hand side of this slide saying Deep learning for search ranking. That actually covers the modeler's perspective on, uh, releasing deep learning models at Etsy. Um, so that's, that's, um, away from the infrastructure side. Okay. Yeah. So, um, ranking in real time can be difficult and resource intensive.
Um, so specifically Etsy, just to talk through what's going on here. When you type in a search term into Etsy's search box, the first thing we do is we go off and we. Um, do a simple text-based search in solar or, um, other retrieval systems that we're running that give us some number of candidate listings, usually around a thousand.
And then for each of those listings, we need to get hundreds of features, um, together. And then we batch those up. And they get sent in batches of, um, between five and 25 requests per batch and sent to the scoring models. Scoring model. And then, um, once all the batches have come back, we can then, um, rank the listings and return the, um, the page, er the first page research results back to the customer.
So you can see those numbers multiply quite quickly. One search request coming in a thousand listings divided by between five to 25, and that's the number of inference requests that are coming into the, to the backend system.
So to, to kind of recap, um, these were some of the issues that we were facing launching new deep learning models. So, um, the feature transformations in stuff like TensorFlow transform, um, were adding additional model agency, um, and leading to big, um, uh, spikes in cost of serving these models as well. Um, we discovered that.
Deep learning models have different ideal infrastructure workload settings versus like the previous models, which were mainly like, regression or tree based models and required a lot more tuning. And then we found that like the feedback loops that were existing were too late in the process and that was causing, uh, delays in experimentation and release of value to Etsy's customers.
Kyle: Awesome. I'll take it from here, thanks so much, Rob. So after we saw all of those issues with kind of, not only latency itself but troubleshooting latency, um, we chose to invest in a tool that we each, uh, deemed caliper for early latency feedback. For some historical context here, we've long had ways of latency testing our systems at Etsy. We do have a tool that allows us to end-to-end test, uh, all of our search latency all the way through, as Rob showed the candidate fetching through to the models themselves. But what we lacked was a way to have quick iterative feedback specific to machine learning. Um, our old load testing ways required us to make a lot of different commits in a lot of different places. Update configs and production search, add features to production, all sorts of stuff. What we wanted was a way to just test the model as an individual unit. And so we created a tool called Caliper that allows us to use a UI or a CLI that just takes in a trained model.
And does a bunch of automatic load testing for us using historical data against our production ML system. It creates a deployment, all of that stuff. And then we get a ton of different outputs that are specific to the machine learning domain. We get stuff back from TensorBoard for TensorFlow models, we get fine grain latency and latency buckets. We get error rates, and we also very importantly, get a cost estimate of how costly it's going to be to serve that model.
So here is a short demo video of the CLI for Caliber. Um, so I'm gonna play this and expand it on the screen and hope that it is visible to everyone. So also annotated. So right now, um, this is the caliber tool. Nothing crazy, just running it with a Docker command. We provide a number of different parameters here, including the name of the model we wanna deploy, a bunch of TensorFlow specific stuff, but also the duration of the load test, um, along with some, uh, like proto stuff for tensor flow. And what this is going to do is spin up a load test against a model in our cluster.
In a minute…
Cool. So you can see under the hood it's using GHC, which is a CLI tool for running GRPC-based, uh, load tests. A very easy thing to build around but add a lot of our own machine learning, uh, specific stuff into. On the server side, our deployments not only host the model in a Kubernetes cluster, but also allow for the optional spinning up of TensorBoard in the same container. Once the load test finishes and scrapes all the information from TensorBoard and gets the latency, we get a nice HTML output that includes not only the latency distribution and the errors and all that stuff but also TensorBoard specific information. We can then export this information as a JSON or CSV to do analysis with it, compare it against other models and all sorts of stuff.
Very useful tool to have. Just like all agile things, the most important thing is failing fast and failing early, which is one of the big wins behind using this tool.
The other thing that we needed to troubleshoot was actually solving the problem itself. So for anyone who has worked in a distributed system, uh, identifying a problem can be as much as, I mean, 99% of the battle for me, honestly. Um, so here is a simple rendition of a distributed system.
Our distributed system where we're fetching features for inference. Um, going to our orchestration layer to fetch candidates, um, going, sending a GRPC request which… remember those 1000 candidates, 300 features each, and batches of five or 25 to our Kubernetes cluster that then has a number of network hops towards an actual TensorFlow model itself.
And this–which should have been an animation–is just a kind of question mark around: what is the latency of this step? And what is the latency of this step? And this step, and this step and this step. And the critical thing here is being able to, um, identify the specific points in your large distributed system that are having the largest effect on, on latency.
And what I'm really talking about here is of course, observability. So the big thing that we did was invest in a bunch of different observability tooling for our system. We added distributed tracing for system-wide observability, which was great cuz then we could actually get specific network hop latency.
We added new metrics: So we moved from the engine X ingress controller to Contour and Envoy proxy, which gave us a bunch of new GRPC metrics and allowed us to create new dashboards that were really, really helpful for debugging. And then of course, with tooling like Caliper, we added new ways to test our latency directly against the model itself and get an isolated and machine learning specific view of, say, computationally expensive TensorFlow operations in production, um, the GRPC request itself, all that sort of stuff.
However, even with all of that, we still haven't actually talked about troubleshooting the latency. So once we had the means to identify latency, um, we were still seeing super, super high latency with GRPC requests. So, context here on the left we have, um, the network request size, little Grafana panel from one of our old models, which used rest, um, and JSON payloads. Um, this is with boosted treats. On the right side, we have a request size, average request size for our deep learning model, which uses GRPC proto buffs.
You can see on the rest side, we have kind of an average, probably four or five megabytes per request. Um, keep in mind, you know, we're sending thousands and thousands of requests per second, so it's quite a lot of data going into the system. But even with the much smaller request side on the GRPC side, we saw higher tail latencies: This is P 99. Um, and that was really, really interesting to us.
So the thing that we did was try to make that payload even smaller. We worked with our partners within ML enablement and storage orchestration to use payload compression for GRPC. Reducing the payload size by 25% actually resulted in a P-99 latency drop of 50 milliseconds for these models, which is a huge win for users of search, and the infrastructure that we have to spin up in order to support low latency operations like search.
In summary, test latency and cost as early as possible. This little graph here shows all the steps you previously had to do in order to actually run a load test against our system, which was quite a lot, especially for ML practitioners.
And just the importance of getting that latency and cost signal as early as possible in the model development cycle. So if, you know, if you need to do feature selection or prune expensive TensorFlow transforms, um, all of those could lead to experiment, launch delays and rush troubleshooting.
And observability for the absolute win in distributed systems.
Investing in granular observability like distributed tracing was huge for troubleshooting difficult latency issues, especially weird tail latency issues at the request level where we're just looking at, you know, a subset of requests that have very, very high latency and we don't kind of get that granularity with the average metrics that are bucketed from Prometheus and Grafana.
And we'll end it there. Um, so to close out, thank you so much. Um, we are hiring, so check out Rob's post here on LinkedIn. It's really, really fun, we're the best team ever. If you see the post it's awesome. Anything else to say, Rob?
Rob: No, I think, um, we can take any questions that anyone might have.
There was a question in the chat, which was, which tool do we use for load testing? So, uh, as I replied in the chat, we used GHC for load testing, uh, GRPC, and that's what we have integrated into that tool Caliper. Um, we have also used, um, hey, uh, for load testing, rest services, I think Kyle, is that right?
Kyle: Yes. We have, and Locust in Python.
Rob: So we have a question from the community. Um, can you discuss some of the challenges faced, especially around stakeholder buy-in when incorporating deep learning into Etsy's ML platform?
Um, I'll start answering this and let Kyle as well, chime in. I think, I mean this was really stakeholder driven. Um, from our perspective, you know, this was the modelers at Etsy wanting to use deep learning frameworks and so asking for this, this support. I mean, we had built, um, some early support for TensorFlow, like before TensorFlow really started to take off. We built some of that support actually into a legacy ML framework that existed. Um, and then as teams started to get the, the teams started to get interested in TensorFlow, there was a push to have full support for TensorFlow serving and kind of first class support for TensorFlow in the platform.
Um, and so I don't think we really faced any particular challenges around stakeholder buy-in. That's my sense. Um, any, any comments, Kyle?
Kyle: Yeah, I don't have anything additional to add there. It really was, you know, like us trying to keep up with their demands as stakeholders use more and more deep learning, our systems need to adapt accordingly.
Rob: Another question here. So what steps has Etsy taken to make the deep learning component components of its ML platform user-friendly for data scientists and other non ENG numbers?
Yeah, this is a great question. Um, I think honestly, It's a challenge that probably faces every ML platform team, right? The pace of evolution in ML is always outpacing the amount of support you can build for these things and the support that you build for these things is often not, uh, not, not very user friendly at first. Um, so I think like developer experience is something that we're, um, we've invested quite a lot in over the past year or so, having got the initial support out for deep learning models, we've then started to look at what are the things that frustrate or slow down our customers across the whole like, value stream of delivering ML at Etsy? And some of that stuff is, is things that, you know, can be locally solved for within the model serving team, we've been making improvements just to the general usability of our model serving platform and trying to kind of abstract away some of the complexity that we were previously exposing to customers. But as well there's been larger initiatives across. ML enablement at Etsy to look at that whole user journey and build tooling and user interfaces that make this more accessible to people who are not, um, software engineers by trade and who perhaps aren't so comfortable working in CLIs.
So actually the tool that Kyle was showing there, um, caliper, which is. Originally it was CLI- driven and still has a CLI to it that is now exposed through a user interface. Um, we have a system called Model Hub, which is in development that allows practitioners to interact primarily through, um, web-based interfaces to use some of these tools.
Kyle: Yeah. And just to add on to that, um, we also released a recent post about our, uh, the interface for our modeler platform, which I will share in the chat. But that talks a lot about the evolution of our platform over time in regards to, um, specific to model serving,, but how our ML practitioners interface with, um, essentially model deployments within our platform.
Um, and there was one more question up top. Will you use chatGPT plugin for search? What do you think, Rob?
Rob: (Laughs) I mean, I think, um, I think LLMs are hot topic, right? I think, uh, probably most companies are are looking into how how those can be, can be used. I don't have anything concrete I can share right now, but I think it is something that everybody's looking into is how can these technologies be leveraged for things like product search.
Kyle: What an amazingly safe answer. Thank you, Rob.
(Another question) Can you please expand on the different types of testing that you do on your pipelines models, um, et cetera? Um, yeah, I think it kind of goes down to… What you would call under the umbrella of MLOps. Um, we have of course, software testing in the form of unit tests and integration tests on, um, not only ML specific, like the modeling code that we have, but our products themselves. And then we of course have things like trying to integrate more around data tests. And, you know, ML specific tests that test the inputs to our models, the outputs to our models, making sure that those things are valid. And, um, of course we have a ton of things around latency and other stuff like that as we just shared here.
Um, but I guess. The core of this rambling is just that as like ML kind of has that additional aspect of, you know, a model artifact and data, which are additional things you need to test in addition to the software that you have powering them. Thoughts, Rob?
Rob: Yeah, just to add the big area of investment has been in observability across both infrastructure and the models themselves. And you know, that's, um, that, that's another aspect of what gives you confidence, right around the overall delivery of quality is if you can, um, detect issues early and resolve them quickly. Um, and observability is a huge part of that.
Thanks so much everyone.
Kyle: Yeah. Thank you so much.