A Quick Start To Data Quality Monitoring For Machine Learning
Data is quickly becoming the lifeblood of our current technologies enabling companies to build, measure, and improve new experiences for their customers. Today this is not just limited to the technologies on the cutting edge; rather it is becoming exceedingly common across many sectors of business to collect and utilize data.
Now with the rise of machine learning making new customer experiences possible, a renewed reliance on data is emerging. In this new context of ML-powered systems, building and maintaining high-quality data sources has never been more important. Today’s ML systems require copious amounts of data to perform well, and handling this volume of data is causing real problems in the companies that have adopted these technologies.
In practice today, a model is often only as good as the data it is trained on. Data quality doesn’t stop being important after the model is trained, but continues to remain important as the model is deployed in production. The quality of the model’s predictions is highly dependent on the quality of the data sources powering the model’s features. In this piece, I’ll give into why your team should be paying close attention to the quality of your data and the impact to your model’s end performance.
What do we Mean by Data Quality?
Data quality is a broad term and can cover a wide variety of issues in your data. To start, let’s define what we aren’t going to talk about in this piece. In this piece, we are not going to concern ourselves with “slow bleed” failures such as gradual drift in your data over time. If you are interested in learning about this extremely important topic, you can take a look at some of our earlier pieces where we go more in-depth around this.