To paraphrase a common bit of wisdom, if a machine learning model runs in production and no one is complaining, does it mean the model is perfect? The unfortunate truth is that today production models are usually left alone unless they lead to negative business impacts.
Let’s look at an example of what may happen today:
As a machine learning engineer (MLE) for a fintech company, you maintain a fraud-detection model. It has been in production for a week, and you are enjoying your morning coffee when a product manager (PM) urgently complains that the customer support team has seen a significant increase in calls complaining about fraudulent transactions.
Gulp. Is it your model? Software engineers tell you that the problem is not on their end.
This costs the company a fortune in chargeback transactions. The company is spending tens of thousands of dollars every hour, and you have to fix it now.
You write a custom query to pull data from logs of the last million predictions that your model has made in the past three days. The query takes some time to run, you export the data, do some minimal preprocessing, import it into a Jupyter notebook, and eventu ally start calculating relevant metrics for the sample data you pulled.
There doesn’t seem to be a problem in the overall data. Your PM and customers are still complaining, but all you see is maybe a tiny increase in fraudulent activity.
More metrics, more analysis, more conversations with others. You slice your data in this way and that and eventually see something odd. If you slice by geographies, California seems to be performing somewhat worse than it did a few days ago. You filter to California and realize some of the merchant IDs belong to scam merchants that your model did not pick up. You retrain your model and save the day.
<mark>This example helps us see what it takes to troubleshoot a machine learning model today. It is many times more complex than troubleshooting traditional software. If you look at it closely, it does seem like we are shipping AI blind.</mark>
Now let’s talk about where we want to go, starting with fundamentals.
First, let’s make sure we have a definition of what monitoring is: Monitoring, at the most basic level, is data about how your systems are performing; it requires that data are made storable, accessible, and displayable in some reasonable way.
So what does monitoring mean to the world of machine learning models?
A model has to make some prediction. This can be predicting the ETA of when the ride is going to arrive in a ride-sharing app. It can also be what loan amount to give a certain person. A model can predict if it will rain on Thursday. At a fundamental level, this is what machine learning systems do: they use data to make a prediction.
Since what you want is to predict the real world, and you want that prediction to be accurate, it is also useful to look at actuals. An actual is the right answer—it is what actually happened in the real world. Your ride arrived in five minutes, or it did rain on Thursday. Without comparison to the actuals, it is very difficult to quantify how the model is performing until your customers complain.
Subscribe to get the latest news, expertise, and product updates from Arize. Your inbox is sacred, so we’ll only curate and send the best stuff.
Like what you see? Let’s chat. Fill out this form and we will be in contact with you soon!