ML Monitoring & Data Science Glossary

Learn more about data science and machine learning (ML) monitoring with this glossary from Arize. Find definitions to common terms below.

Accuracy

Accuracy is the measure of the number of correct predictions made by the model. It is derived by calculating the percentage of correct predictions out of overall predictions. accuracy = correct predictions / all predictions

Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent. If your model predicts that 95 transactions are legitimate and 5 transactions are fraudulent, its accuracy is:
95% = (90 correct legitimate + 5 correct fraudulent) / 100 transactions

Additional resources

ML Observability 101 Ebook
Best practices for observing ML models in production
Monitoring Model Performance
The playbook to monitoring your model's performance in production

Baseline

A baseline is the reference data or benchmark used to compare model performance against for monitoring purposes. Baselines can be training data, validation data, prior time periods of production data, a previous model version, among others.

Additional resources

Definitive ML Observability Checklist
A buyer's guide on potential product and technical requirements for ML observability platforms
Arize Platform Demo
Covers baseline setup

Baseline Distribution

A baseline distribution refers to a model dataset used as a reference or comparison to the model’s current (production) distribution. Baseline distributions in Arize AI’s platform can be training datasets, validation datasets, or prior time periods of production.

Additional resources

Definitive ML Observability Checklist
A buyer's guide on potential product and technical requirements for ML observability platforms
Setting a Baseline
Baseline setup with Arize

Binary Classification Model

Binary classification refers to machine learning algorithms that have classification tasks that have only and only two class labels. Binary classification involves one “positive” and one “negative” class state in general.

Additional resources

ML Observability 101 Ebook
Best practices in ML observability for binary classification models
A Quick Start Guide To Data Quality Monitoring for ML
Identifying hard failures in your data pipelines

Binning

Binning is a way to group a number of continuous values together into smaller cohorts or “bins”. The technique helps reduce the cardinality of data by representing the points in intervals — for example, age ranges.

Additional resources

Using Statistical Distances In ML
An overview of the types of bins and the use of binning in machine learning

Canary Deployment

A method of testing a new model or model version where only a small subset of production data flows through this model to verify response performance before making a complete cutover. This technique allows for deeper analysis and understanding of model behavior and can minimize risk that regressions severely impact the business or customers.

Additional resources

Move Fast Without Breaking Things In ML
An overview on managing change, including use of canary deployments
Model Deployment and Serving
Things to look for in model servers

Classification Model

Classification models are used to predict categories or assign a class label. Any given data is classified into a set of categories or groups to determine its further use or for processing needs.

Data that can be classified into one category or a second category is known as binary data. For example, fraud or not fraud, male or female.

If the set of data can be classified into a number of categories or groups, each based on a different criterion, such data is known as multi-class data. For example, education level, household income.

Additional resources

The Model's Shipped; What Could Possibly Go Wrong?
Identifying model failure modes with classification models

Confusion Matrix

A confusion matrix provides a summary of all prediction results of a classification problem. Each result is shown with its corresponding number of correct/incorrect predictions (see True Positive, True Negative, False Positive, False Negative), count values and classification criteria. By providing a neat summary of all possible results, the confusion matrix lets you know the ways your classification model could get confused when making the predictions. It helps identify errors and the type of errors made by the model and thus helps improve the accuracy of the classification model.

Additional resources

Understanding Bias in ML Models
Utilizing a confusion matrix

Current Distribution

Current Distribution refers to the statistical distribution, or shape, of the dataset being generated by a machine learning model in production. Distribution of datasets in machine learning models are represented in the form of functions that show the relationships between the various observations, visually presented in the form of curves or graphs.

Additional resources

Using Statistical Distances In ML
Measuring drift
Model Drift
One-stop shop for all things drift-related.

Data Quality

Data quality refers to the integrity and consistency of the data sets used. In monitoring machine learning performance, data quality measures include attributes such as missingness, out of range, P1 and P99, type mismatch, among others.

Additional resources

Data Quality Monitoring
A quick start guide
Solving Data Quality
Ensuring high quality for structured data with ML observability

Deep Learning

Machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications. Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems inspired by the human brain’s network of neurons.

Deep Learning Model

A deep learning model normally refers to a neural network, typically with more than two layers. Deep learning is usually used for computationally dense tasks like computer vision (images) and natural language processing.

Drift

Drift is defined as the change in the data over time. It also means the change in the properties of the target variable, due to unpredictable or unforeseen changes, over the due course of time.

Data drift can be described as the change in the distribution of data, between the real-time data and the baseline data that was predicted or set beforehand.
Concept drift is the change between the relationship between input and the output given in any situation.

Drift can be in any form. It can be gradual, recurring, or sudden. It can be a positive or negative drift. The change in data over time can affect model outcomes, making drift an important metric to monitor when it comes to model performance.

Additional resources

Take My Drift Away
How to troubleshoot and resolve the underlying issue when drift occurs
Model Drift
A one-stop shop for all things drift-related
Monitor Drift
How to troubleshoot drift with Arize

Embedding

In natural language processing (see definition of ‘natural language processing), embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

Evaluation (Inference) Store

A machine learning infrastructure tool to monitor and improve model performance. Think of them as the ledger or log of model activities/inferences. Evaluation stores are used to:

Surface up performance metrics in aggregate (or slice) for any model, in any environment — production, validation, training
Monitor and identify drift, data quality issues, or anomalous performance degradations using baselines
Enable teams to connect changes in performance to why they occurred
Provide a platform to help deliver models continuously with high quality and feedback loops for improvement — compare production to training
Provide an experimentation platform to A/B test model versions

Additional resources

The Only Three ML Tools You Need
More on the evaluation store in context

Evaluation Metric

The way the performance of a predictive model is quantified and calculated is known as the evaluation metric. It is used to evaluate the accuracy and the performance of the model used.

Evaluation Window

The evaluation window is plot of the period or duration of time against the metric being calculated. For instance, the previous 30 days. Any evaluation metric that can be represented as a duration of time can be visualized as an evaluation window.

Explainability

The total extent to which the machine learning internal mechanics can be explained in human-understandable terms only. It is simply the process of explaining the reasons behind the machine learning aspects of output data.
See ‘SHAP’.

Additional resources

Explainability Primer
A deeper dive into global, cohort and local explainability across ML lifecycle
Using Explainability
How to use tackle explainability with Arize

F-score

Measure of the harmonic mean of precision and recall. F-score is a result of integrating these parameters into one for a better understanding of the accuracy of the model. F-score can be modified into F, 0.5, 1, & 2 based on the measure of weightage given to precision over recall.

Additional resources

Monitoring Model Performance
Monitoring model performance in production using F1

False Negative

When a model mistakenly predicts a negative class, when the value belongs to the positive class.

Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a fraudulent transaction

Additional resources

Best Practices In ML Observability for Monitoring, Mitigating and Preventing Fraud
Real-world use-cases: fraud models
Best Practices In ML Observability for Click-Through Rate Models
Real-world use-cases: click through rate models

False Positive

When a model mistakenly predicts a positive class, when the value belongs to the negative class.

Example: A model flags a credit card transaction as ‘fraud’ when it was not actually a fraudulent transaction.

Feature Importance

Feature importance is a compilation of a class of techniques that take in all the features related to making a model prediction and assign a certain score to each feature to weigh how much or how little it impacted the outcome. These scores can then be used to better understand the internal logic of a model, make necessary changes to the model to improve its accuracy, and also reduce unnecessary inputs.

Additional resources

Model Explainability
A primer on global, cohort, and local explainability

Feature Performance Heat Map

A feature performance heat map is a visual representation of the performance of each feature in a given model. It enables users to quickly see slices of performance or features that perform significantly better or worse than others for faster triangulation of issues. Heat maps are especially useful when troubleshooting.

Feature Store

A machine learning infrastructure tool that handles offline and online feature transformations. Think of them as the interface between your models and data. Feature stores are used to:

Serve as the central source for feature transformations
Allow for the same feature transformations to be used in both offline training and online serving
Enable team members to share their transformations for experimentation
Provide a strong versioning for feature transformation code

Additional resources

Feast + Arize Partnership
Supercharging feature management and model monitoring for MLOps

JS Distance

JS distance is a symmetric derivation of KL divergence, and it is used to measure drift. In addition to being an actual metric (as opposed to KL), it is bounded by the square root of ln(2). For two distributions P and Q, the formula for JS distance is shown below. Use JS distance to compare distributions with low variance.

$js distance math$

Additional resources

Using Statistical Distances for Machine Learning
Common Metrics for Drift and Where To Use Them
When I Drift, You Drift, We Drift
Understanding Different Types of Model Drift
Take My Drift Away
Strategies for Responding to Model Drift

KL Divergence

The Kullback-Leibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution. KL divergence is sometimes referred to as ‘relative entropy’ and best used when one distribution is much smaller in sample and has a large variance.

Additional resources

Using Statistical Distances for Machine Learning
When and where to use KL Divergence
Take My Drift Away
How to measure drift
Model Monitoring for Drift
Drift analysis techniques

Kolmogorov Smirnov Test

Useful in drift monitoring, the Kolmogorov Smirnov test (KS test) is a widely used nonparametric technique for comparing difference between a sample with a baseline or reference probably distribution). KS test is an efficient drift metric to measure if two distributions significantly differ from one another. The Kolmogorov Smirnov statistic quantifies the maximum distance between two cumulative distribution functions.

Additional resources

LIME

LIME, or “Local Interpretable Model-Agnostic Explanations,” is an explainability method that attempts to provide local ML explainability. At a high level, LIME attempts to understand how perturbations in a model’s inputs affect the end-prediction of the model. Since it makes no assumptions about how the model reaches the prediction, it can be used with any model architecture, hence the “model-agnostic” part of LIME. The LIME explainability approach takes a single input value of predictions and perturbs the inputs around those values. It then builds a linear model off of the feature perturbations where the coefficients are the feature importances at this local prediction.

Additional resources

Model Explainability
A deeper dive into global, cohort and local explainability across ML lifecycle including LIME
LIME Paper
Explaining the predictions of any classifier

Logarithmic Loss

Tracks incorrect labelling of the data class by the model and penalizes the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values.

MAE

Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset. Unlike MSE, MAE is weighted on a linear scale and therefore doesn’t put as much weight on outliers. This provides a more even measure of performance, but means large errors and smaller errors are weighted the same. Something to consider depending on your specific model use case.

MAPE

Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE. MAPE measures the average magnitude of error produced by a model, or how far off predictions are on average. See MAE for considerations when using this metric.

Model Performance

The performance of a machine learning model indicates its usability and ability to provide accurate results. Performance is usually measured in terms of metrics that apply to the specific type of machine model concerned. Here are some common metrics used according to the type of machine model:

Regression based machine learning models – MSPE, MSAE, R Squared and Adjusted R Squared
Classification – Precisions-Recall, ROC-AUC, Accuracy, log-loss
Unsupervised models – Rand index, Mutual information

Additional resources

Performance Monitoring
The playbook for monitoring model performance in production
Monitoring Model Performance
The only 3 ML tools you need

Model Store

A machine learning infrastructure tool that serves as central model registry and track experiments. Think of them as the library or catalog of your models. Model stores are used to:

Serve as a central repository of all models and model versions
Allow for reproducibility of every model version
Track lineage of models history

Monitor Threshold

Monitor threshold refers to the value set for a model monitor, beyond which the model’s monitoring status will be triggered accordingly. The threshold value can be set on any specific performance metric such as accuracy, MSE, MAPE, etc.

MSE

Mean Square Error a regressive loss measure. The MSE is measured as the difference between the model’s predictions and ground truth, squared and averaged out across the dataset. It is used to check how close the predicted values are to the actual values. As in RMSE, a lower value indicates a better fit, and it heavily penalizes large errors or outliers.

Natural Language Processing (NLP)

Natural language processing (NLP). The inputs to these models are typically sentences, such as: “This definition is so informative.” These inputs are broken up into tokens: “This” “definition” “is” “so” “informative.” Most commonly, a classification model runs on top of NLP.

Performance Impact Score

Performance impact score is a measure of how much worse your metric of interest is on the slice compared to the average. Sorting by performance impact score enables you to narrow in on a slice impacting performance (i.e. accuracy).

Additional resources

What Is ML Performance Tracing?
See definitions and real-world examples of ML performance tracing and performance impact score.

Performance Slice

A performance slice is a subset of model values of interest in performance analysis and troubleshooting. Slices can be formed from any model dimension, including specific periods of time, set of features, etc. Performance slice analysis is useful when the goal is to understand or troubleshoot a cohort of interest, such as with bias detection, where the generalized dataset might mask statistical nuances.

Population Stability Index (PSI)

Population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time. PSI is calculated as: PSI = (% Actual – % Expected) x ln(% Actual / % Expected)

The larger the PSI, the less similar your distributions are, which allows you to set up thresholding alerts on the drift in your distributions. PSI is a great metric for both numeric and categorical features where distributions are fairly stable.

Additional resources

Using Statistical Distances for Machine Learning
A white paper on commonly used statistical distance metrics, including PSI

Population Stability Index (PSI) graphic

PR Curve

The Precision-Recall curve is the correlation between the precision and recall at particular cut-off values, with the cut off values being set according to the particular model.

Precision

Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class precision = true positives / (predicted true positives + predicted false positives)

Example: There are 100 credit card transactions; 80 transactions are legitimate (positive class) and 20 transactions are fraudulent. If your model predicts that 85 transactions are legitimate, its precision is: 94.12% = 80 true positives / (80 true positives + 5 false positives)

Additional resources

Monitoring Model Performance
Playbook for performance analysis of production models

Prediction Drift Impact

The product of feature importance and drift (population stability index — PSI).

Additional resources

Model Drift
Learn what constitutes model drift, how to monitor for drift in machine learning models, and drift resolution techniques.
Using Statistical Distances for Machine Learning
Use cases for statistical distance checks across model inputs, model outputs and actuals.
Take My Drift Away
How to troubleshoot model drift.
When I Drift, You Drift, We Drift
The difference between concept, data, and model drift.

Quantile

Quantiles are the points dividing the range of a probability distribution into intervals with equal probabilities.

Additional resources

Statistical Distances for ML
A white paper on commonly used statistical distance metrics

Recall

Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives) recall = predicted true positives / (true positives + false negatives)

Additional resources

Monitoring Model Performance
The playbook for using recall and other model metrics to monitor your model's performance in production
ML Observability 101 Ebook
Best practices for observing ML models in production

Regression

Regression analysis is a fundamental concept in data science and machine learning. It helps quantify the relationship between the inputs into a model and its outputs. Essentially, it is an estimation of how a variable affects a set of independent variables.

RMSE

Root Mean Square Error (also known as root mean square deviation, RMSD) is a measure of the average magnitude of error in quantitative data predictions. It can be thought of as the normalized distance between the vector of predicted values and the vector of observed (or actual) values.

Because errors are squared before averaged, this measure gives higher weight to large errors, and therefore useful in cases where you want to penalize models accordingly.

ROC – AUC

The Receiver Operating Characteristics (ROC) is a probability curve plotted between true positive rate (TPR) and false positive rate (FPR). Area Under the Curve (AUC) is an aggregate measure of performance across all possible classification thresholds. Together, ROC – AUC represents the degree of separability, or how much a model is capable of distinguishing between classes.

The higher the AUC (i.e. closer to 1), the better the model is at predicting 0 class as 0, and 1 class as 1.

Additional resources

What Is AUC?
A quick-and-intuitive explanation of AUC -- includes gifs that animate how the ROC curve is constructed.

Score Models

Score models generate a numeric value as its prediction or output. For example, the likelihood that an input belongs to a category.

Sensitivity

Sensitivity is a measure of the number of positive cases that turned out to be true for a given model. It is also called the true positive rate.
sensitivity = predicted true positives / (true positives + false negatives)

Shadow Deployment

A method of testing a candidate model for production where production data runs through the model without the model actually returning predictions to the service or customers. Essentially, simulating how the model would perform in the production environment.

Additional resources

Move Fast Without Breaking Things In ML
An overview on managing change, including use of shadow deployments

SHAP

SHAP stands for “Shapley Additive Explanations”, a concept derived from game theory and used to explain the output of machine learning models (see definition of ‘Explainability’). SHAP values help interpret how much a given feature or input contributes, positively or negatively, to the target outcome or prediction. See ‘Feature Importance’

Additional resources

Model Explainability
A primer on global, cohort, and local explainability
Using Explainability
How to use SHAP with Arize

Specificity

Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives). This measure is similar to recall, but describes the offset in correcting predicting negative values. It is also called the true negative rate.

specificity = predicted true negatives / (true negatives + false positives)

Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent (negative class). If your model predicts that 20 transactions are fraudulent, its recall is:
50% = 10 true negatives / (10 true negatives + 10 false positives)

Tabular Data

Data in a table format, with columns and rows. Inputs of the model in a table format (i.e. an Excel spreadsheet), where columns might be feature inputs (i.e city, state, charge amount). NLP and images do not fit in an excel sheet, since inputs are sentences or images.

True Negative

When a model correctly predicts a negative class, when the value belongs to the negative class.

Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a legitimate transaction.

Additional resources

Best Practices In ML Observability for Click-Through Rate Models
Real-world use-cases: click through rate (CTR) models
Best Practices in ML Observability for Monitoring, Mitigating and Preventing Fraud
Real-world use-cases: fraud models

True Positive

When a model correctly predicts a positive class, when the value belongs to the positive class.

Example: A model flags a credit card transaction as ‘fraud’ when it is actually a fraudulent transaction.

Additional resources

Best Practices for Monitoring Fraud Models
Real-world use-cases: fraud models
Best Practices for Monitoring Click Through Rate Models
Real-world use-cases: click through rate (CTR) models

Arize AX

Arize Phoenix

Learn

Insights

Company

ML Monitoring & Data Science Glossary

Learn more about data science and machine learning (ML) monitoring with this glossary from Arize. Find definitions to common terms below.

Contents

Top

Accuracy

Additional resources

Baseline

Additional resources

Baseline Distribution

Additional resources

Binary Classification Model

Additional resources

Binning

Additional resources

Canary Deployment

Additional resources

Classification Model

Additional resources

Confusion Matrix

Additional resources

Current Distribution

Additional resources

Data Quality

Additional resources

Deep Learning

Deep Learning Model

Drift

Additional resources

Embedding

Evaluation (Inference) Store

Additional resources

Evaluation Metric

Evaluation Window

Explainability

Additional resources

F-score

Additional resources

False Negative

Additional resources

False Positive

Feature Importance

Additional resources

Feature Performance Heat Map

Feature Store

Additional resources

JS Distance

Additional resources

KL Divergence

Additional resources

Kolmogorov Smirnov Test

Additional resources

LIME

Additional resources

Logarithmic Loss

MAE

MAPE

Model Performance

Additional resources

Model Store

Monitor Threshold

MSE

Natural Language Processing (NLP)

Performance Impact Score

Additional resources

Performance Slice

Population Stability Index (PSI)

Additional resources

PR Curve

Precision

Additional resources

Prediction Drift Impact

Additional resources

Quantile

Additional resources

Recall