Data is quickly becoming the lifeblood of our current technologies enabling companies to build, measure, and improve new experiences for their customers. Today this is not just limited to the technologies on the cutting edge; rather it is becoming exceedingly common across many sectors of business to collect and utilize data.
Now with the rise of machine learning making new customer experiences possible, a renewed reliance on data is emerging. In this new context of ML-powered systems, building and maintaining high-quality data sources has never been more important. Today’s ML systems require copious amounts of data to perform well, and handling this volume of data is causing real problems in the companies that have adopted these technologies.
In practice today, a model is often only as good as the data it is trained on. Data quality doesn’t stop being important after the model is trained, but continues to remain important as the model is deployed in production. The quality of the model’s predictions is highly dependent on the quality of the data sources powering the model’s features. In this piece, I’ll give into why your team should be paying close attention to the quality of your data and the impact to your model’s end performance.
What do we Mean by Data Quality?
Data quality is a broad term and can cover a wide variety of issues in your data. To start, let’s define what we aren’t going to talk about in this piece. In this piece, we are not going to concern ourselves with “slow bleed” failures such as gradual drift in your data over time. If you are interested in learning about this extremely important topic, you can take a look at some of our earlier pieces where we go more in-depth around this.
What does that leave? Well, this broadly leaves the concept of hard failures in your data pipelines. To dig a bit deeper, let’s break out the concept of a categorical data stream vs. a numerical data stream.
Categorical data is just what it sounds like, a stream of categories like the type of pet someone owns: dog, cat, bird, pig? etc.
To start, something that can go wrong with a categorical data stream is a sudden shift in the distribution of categories. To take it to an extreme, let’s say your hypothetical model predicting which pet food to buy for your pet supply store starts seeing data saying that people only own cats now. This might cause your model to only purchase cat food, and all your potential customers with dogs will have to go to the pet supply store down the street instead.
Data Type Mismatch
In addition to a sudden cardinality shift in your categorical data, your data stream might start returning values that are not valid to the category. This is, quite simply, a bug in your data stream, and a violation of the contract you have set up between the data and the model. This could happen for a variety of reasons: your data source being unreliable, your data processing code going awry, some downstream schema change, etc. At this point, whatever comes out of your model is undefined behavior, and you need to make sure to protect yourself against type mismatches like this in categorical data streams.
One incredibly common scenario that practitioners run into, is the problem of missing data. With the rising number of data streams used to compute large feature vectors for modern ML models, the likelihood that some of these values will be nil is higher than ever. So what can you do about it?
One thing you certainly can do is throw your hands up in the air and discard the row in a training context, or throw an error in your application in a production context. While this will help you avoid this problem, it’s possibly not the most practical. If you have hundreds, thousands, or tens of thousands of data streams used to compute one feature vector for your model, the chance that one of these streams is missing can be very high!
This brings us next to how you might fill this missing value, commonly referred to as imputation. For categorical data, you could choose the most common category that you have historically seen in your data, or you could use the values that are present to predict what this missing value likely is.
A numerical data stream is also pretty self-explanatory. Numerical data is data that is represented by numbers, such as the amount of money in your bank account, or the temperature outside in Fahrenheit or Celsius.
Out of Range Violations
To start things off, something that can go wrong with numerical data streams is out of range violations. For example, if age was an input to the model and you are expecting the age to be between 0–120, but suddenly receive a value in the 300s, this would be considered out of range.
Type mismatch can also affect numerical data. It’s in the realm of possibility that for a particular data stream where you are expecting a temperature reading that you are returned a categorical data point, and you have to handle this appropriately. It’s possible that the default behavior may be to cast this categorical value to a number that, although now valid, has entirely lost its semantic meaning and is now an error in your data that is incredibly hard to track down.
For numerical data, you have a few more options for imputations, such as taking the average, median, or some other distribution metric for this particular value. The complexity of your solution to this problem is entirely up to your application scenario, but it’s important to know that no solution is perfect here.
Challenges with Monitoring Data Quality Today
Now that we have gotten a better idea of what possible data quality issues you may run into, let’s now briefly dive into some common challenges that practitioners run into when attempting to keep tabs on the quality of their data.
Before we start here, it’s important to note that this is different from the broad product space of data observability. Data observability tools are mostly focused on monitoring the quality of tables and data warehouses, while ML Observability is focused on monitoring the inputs and outputs of models. These models are consistency evolving, features are being added and changed, and so the data quality monitoring of models must be able to evolve with the schema of the model.
Too Much Data to Keep Tabs on
It’s not surprising to many current ML practitioners that many models these days rely on tons of features to perform their tasks. One rule of thumb, guided by recent advances in statistical learning theory, suggests that a model can effectively learn approximately a feature for every 100 examples you have in a training set. With training set sizes exploding into the hundreds of millions and even billions, models with feature vector lengths in the tens and hundreds of thousands are not uncommon.
This leads us to a major challenge that practitioners face today. To support these incredibly large feature vectors, teams have poured larger and larger data streams into feature generation. Writing code to monitor the quality of each of these data streams is fundamentally untenable, and the reality is that this data schema will inevitably change often as the team experiments to improve the model.
At the end of the day, no one wants to sit there and hand configure thresholds, baselines and set up a custom data monitoring system for each of these data streams that are feeding into the model. It’s common to add a feature, drop a feature, change how it is computed, and adding more work into the ML development loop will only slow you and your team down.
Now that we understand some of the current challenges around monitoring and fixing data quality issues, what can we do about it? To start, teams need to start keeping track of how the quality of their data affects the end performance of their model.
Leverage Historical Information
Ultimately the model’s performance is what we care about, and it’s very possible that the quality of some data is worth more than that of others. To avoid manually creating baselines and thresholds for each data stream, teams need to have a history of data to look at either from training sets or from historical production data.
Once these historical distributions have been determined, your monitoring system can have a better idea about what it should consider an outlier in a numerical stream, and generate alerts when a categorical stream has strongly deviated from its historical distribution. From these distributions, intelligent baselines and thresholds can be created to balance how “noisy” or likely to fire these alerts are, giving power to the model team to balance risk vs reward.
On top of setting up automatic alerting systems for all of your data streams, your data quality monitoring system should also allow you to enforce type checks to protect against downstream errors in your model and avoid potential typecasting issues.
Hill Climb using Model Performance
Lastly, by keeping track of your model’s end performance, your monitoring system should also allow you to test out different imputation methods for your data and give you the performance impact for this new imputation strategy. This provides confidence that the choices you are making are positively impacting the end performance of the model.
As fast as machine learning has progressed and made its way into some of our most crucial products and services, the tooling to support these experiences has lagged behind. These core features of a modern data quality monitoring system bring back control to the ML engineer and remove a large amount of guesswork, which unfortunately has crept very deeply in the art of productionizing machine learning