What is ML Observability?
As more and more teams turn to machine learning to streamline their businesses or turn previously impractical technologies into reality, there has been a rising interest in tools and people who can help bring a model from the research lab into the customer’s hands. Google built TFX, Facebook built FBLearner, Uber built Michaelangelo, Airbnb built Bighead, and these systems have allowed these times to scale their MLOps.
Outside of these large tech companies, the truth is, building machine learning proof of concepts in the lab is drastically different from making models that work in the real world. Let’s start by taking a quick look at some things that can go wrong in applying a model to a real-world problem.
What Can Go Wrong?
1. Training Serving Skew
When deploying a model, there is a good chance that your model does not perform as well as it did when validating it offline. These handoffs to production don’t always go well and is commonly referred to as training/serving skew.
One potential culprit is that the data your model was trained on is statistically different from the data you see in production. Another possibility is the feature transformation code is not consistent between your training environment and your production environment. This can be more common than one might think. Oftentimes, notebooks containing feature transformation code are passed around and changed without much version control, which can lead to confusion about exactly what kind of transformations were used to create features for the model. If the way that features are created is not consistent between the training and production environment, your model’s performance can take a big hit right out of the gate.
What’s Next
ML observability achieved through the application of an evaluation store can help your team throughout the whole process of validating, monitoring, troubleshooting, and improving your models. Through introspection into your models’ performance over time, ML observability can help your teams identify gaps in training data, surface slices of examples where your model is underperforming, compare model performances side by side, validate models, and identify issues in production. Stop flying blind, and take your ML efforts to the next level.