The Rise of the ML Engineer: Ilya Reznik, Twitter Cortex

Ilya Reznik has seen a lot in his career. With a background in physics and a stint at the Occupational Health and Safety Administration (OSHA), Reznik brings an interesting perspective to industry changes in the wake of the COVID-19 pandemic. “The world can change in an instant, and models are not static,” he reminds us.

We caught up with Reznik during a well-earned vacation between his old role as a Senior Data Science Engineer at Adobe — where he spent nearly six years — and a new role at Twitter.

Arize: You are about to start a new role as a Senior Machine Learning Engineer at Twitter.  Why are you excited to make the move?

Ilya: Twitter Cortex is a pretty well-known organization in the industry. They’ve been doing ML engineering since before it was cool, and part of the reason is so much of their business is based around ML recommendations and other applications so they’ve had a good culture of engineering and using tools to commercialize ML.

I’m going to be on the model development and evaluation team. The evaluation part is very interesting to me because there’s a lot that still needs to be automated and there needs to be better tooling around it. I’m excited to work at Twitter to help continue to scale their ML infrastructure.

Arize: From your background in physics to OSHA and later Adobe, you’ve had an interesting career journey up until this point. Can you walk us through your background and why you became an ML engineer?

Ilya:  I went into physics because I love solving problems. In physics, I joke that about 90% of the time you hate doing physics — but when you get to the end and your answer matches the real world observation, there’s no better feeling.

Machine learning is similar. A lot of the time in ML can be tedious when you’re in the weeds, but then when you get to the final prediction and can say “wow, there’s an algorithm that really does this on par with what I was expecting it to,” it’s a really good feeling and hard to beat.

I’ve always been really passionate about commercializing technology. And since I was a physics and chemistry person, I first went into semiconductors because that’s where there was a lot of progress being made.

After that, I took a position at OSHA which has a tremendous mission of protecting U.S. workers. I was there at an exciting time. We were passing some important new regulations and I got hired as a software engineer. I didn’t get hired to do ML. A few years into the job, Labor Secretary Thomas E. Perez became really interested in data science. I got involved and started looking into what data science could do at OSHA.

I soon got the opportunity to work with Adobe Analytics Cloud, and the big attraction there was really the scale. The biggest data set I worked with at OSHA was about 80 gigabytes. The biggest data set I worked with at Adobe was over 15 petabytes. So it’s just a very different scale of things. And so while I was at Adobe, my first charge was to take stuff from research and put it into production — a typical ML engineer function.

We were a lean team at Adobe Analytics, especially compared to the broader Adobe research team. That’s where my project with Adobe Labs came in — the idea was really how can we start automating some of these things? And that led me in the direction of MLops and ML infrastructure.

Arize: How do you view the ML engineer role in general? Do you think it’s a good time to get into the industry? 

Ilya: It’s a good time to get into the industry; it’s also a hard time to get into the industry. People often ask me about my experience and how they can replicate it. It’s different now because when I went into data science, there were fewer people in the field. It was easier to get into a junior role because most people just didn’t go into it yet. Lately, however, we’ve had a lot of really relevant high-quality college education and there are a lot of really good boot camps. And so the competition has gotten pretty fierce.

But don’t let that discourage you. I think it’s a great time to be an ML engineer, but getting that initial foothold is more difficult than it was a few years ago.

I really see the ML engineer role going the way of MLops more and more. Andrew Ng talks a lot about the data-centric approach and I really think there is a huge benefit to it. For example, vision is a solved problem — convolutional neural nets are basically the solution to the vision problem — and so now the question is what data do you use to train things, how do you train them and what classes do you look at?

You don’t need to be PhD in math or physics in order to do that, you just have to understand the data really well. And I think that will open the door for people to cross-specialize. For example, doctors can now get in and feed the right x-ray images as opposed to trying to figure out which algorithm to use.

I’m excited about this. I’m excited about the tooling, people who are machine learning engineers whose job is going to be to measure model drift in production and understand how things change and how real life changes.

One of the skills that’s going to be more popular for machine learning engineers is subject matter expertise. I think up to this point, we’ve been mostly on the math and software side, but even on the software side — and I’ve seen this at Adobe, and I see this across the industry — there is a move toward using ML on their own data and leveraging their expertise in that context.

So, what are the anomalies in the way that their servers operate? What are the anomalies in the way their own infrastructure works? Analyzing JIRA tickets, things that like that. We’re on the cusp of realizing some of the potential of ML that we’ve been talking about for a decade now.

Arize: What type of models were you running before your current role?

Ilya: At Adobe, my team owned a lot of time series forecasters, anomaly detection, and we were getting into causality over the past few years. We also ran some recommender systems, and did some ML on our internal data where we used natural language processing (NLP) for analysis of JIRA tickets.

Arize: We’d love to hear about the Adobe Analytics Labs project that you wrote about recently. Can you walk us through how and why you’re letting users in on the model development process early?

Ilya: The Adobe Analytics Lab is still in its infancy, but there will likely be a lot of interesting innovations coming from it.

About three and a half years ago, we noticed a persistent problem: it takes too long to test things. In ML, part of the problem is that you have to iterate through things. Especially with the recommender space, I can’t iterate through things without the user. If I were to show someone else my Netflix queue, for example, they wouldn’t know how good the recommendations are without me — such as whether I like Squid Games. What seems like a reasonable recommendation may not actually be to the end user.

And so the question was how can we now start putting recommendations in front of the end user in order to understand how well our model performs — and also know it’s adding value. A lot of ML suffers from obvious insights — when you first train a model, your insights are obvious and it’s not super interesting to anybody yet, and we wanted to skip that step.

John Bates, a product manager at Adobe Analytics, came up with the idea for this Lab. He said:  “Let’s show early prototypes to our users and collect their feedback.” And the first thing he ran into from engineering was, “Absolutely no way; you’re going to crash the whole system.”

And so we needed to find a way to package a model in a way that’s separate from everything else that we’re doing. At this time several years ago, Docker was on the rise and Adobe was moving to Kubernetes — we were the first project at Adobe Analytics fully on Kubernetes — and so there was space to go there. That, combined with a lot of input from colleagues, led to an aha moment: “Oh, we can do it this way.”

The architecture is in the paper for those that want to see it, but the idea is that you can have a totally self-contained model with certain safeguards to prevent it from crashing in production or writing over data in production. Adobe takes customer data very seriously, so there are many guardrails there. The only thing we track with Labs is the usage of Labs, not what data went into it or came out.

Labs allows you to use a new feature that we’re working on in production the way that you would, and we can record what features you’ve used, what parts of the model you’ve used, how did it perform and what kind of compute did it require? And then we can go to our leadership and say, for example, “20% of the users who use this think that it’s too slow, but everybody agrees that it’s an important feature and it would cost X amount of dollars” because now we have real figures to extrapolate.

So it kind of solves multiple problems. The users figure out that we are working on something and they’re delighted to see the latest and greatest. The product managers get to understand what the users actually want. And engineers get the scope because usually whenever you come up with a new ML model, the first question from leadership is: “How much is that going to cost?” And my first answer is: “No idea. This is how much it costs me in development, but I don’t know what it costs in production.” And so we’re able to answer those three questions a lot better now.

Arize: Why do you think monitoring and observability are important in machine learning? 

Ilya: Well, it’s important because we’re engineers and engineers are supposed to build systems of work over time. There’s obviously the cost answer — cost can run away from you if you don’t monitor things and any business cares about that a lot. There are also issues of things like racial bias in facial recognition — an issue that has come up for way too long and needed to be fixed way earlier. As a machine learning engineer, you never want to see your company’s name make national headlines for the wrong reasons.

But I think there’s another reason here, and I’ve experienced this firsthand. My team owned an anomaly detection service at Adobe Analytics in 2020. In January and February of 2020, that service was working flawlessly. Then in March and April of 2020, we were challenged and put in 14 hour days fixing problems because the world changed overnight.

We were fortunate enough to have recognized this early and were able to address things to make it seamless for customers, but the world changes quickly and models are not a static thing. The pandemic was clearly very different for all ecommerce websites. And we saw that. Adobe Analytics saw traffic like we’ve never seen in the first month of the pandemic. Adobe is big, processing one in three clicks on the internet, but even we were just completely slammed in that first month.

I think this is what you get paid for as an ML engineer. Anybody can go and Google how to throw together a convolutional neural network and feed it the right data. But an ML engineer is the person who understands that the model breaks at some point, and they need to be alerted about that not at the point where it broke but a little bit before that so they can do something about that. And so I think observability metrics and all of those things really help you keep your job and rise to the challenge. Arize has been doing great things in ML observability and advancing the field.

Arize: Let’s end on an inspiring note. I see that you worked with a refugee center for three years, volunteering to teach machine learning and computer vision. Are any of those people working in the industry now?

Ilya: Well, hopefully not yet. The cohort I was working with were eighth and ninth graders so they still have a few years. The idea there was not so much to inspire people to go into machine learning engineering — which I hope some of them do — but more to teach the kids who are growing up in a world where ML is driving a lot to understand the basics of it. So we didn’t talk about things like, ‘this is the nitty gritty of Python and how not to use a generator right here.’ We talked more about the data-centric approach to Machine Learning and understanding what the algorithms can do and where they can fail — arriving at a deeper understanding of why representation matters.

We had an example — this wasn’t contrived — of facial recognition in the class. Most of the kids were Bhutanese, and there was one kid from Sudan. And the facial recognition algorithm that I threw on there, which was an Intel model, recognized all the Bhutanese kids really well but totally missed the guy from Sudan. And I asked them: “Why do you think that is?”

We talked about how if the guy from Sudan is not represented anywhere in the ML pipeline, then chances are we haven’t tested on that. When you write an ML algorithm, especially for facial detection, the first face you look at is your own. And if your own face looks like mine, it will work really well on my face, but it won’t work on anybody else’s probably. So you have to be careful about that.

And so maybe some of them will get into the industry, but hopefully not for another five or six years as they continue to enjoy their childhoods.