ML Monitoring & Data Science Glossary
Learn more about data science and machine learning (ML) monitoring with this glossary from Arize. Find definitions to common terms below.
Accuracy
Accuracy is the measure of the number of correct predictions made by the model. It is derived by calculating the percentage of correct predictions out of overall predictions. accuracy = correct predictions / all predictions
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent. If your model predicts that 95 transactions are legitimate and 5 transactions are fraudulent, its accuracy is:
95% = (90 correct legitimate + 5 correct fraudulent) / 100 transactions
Additional resources

ML Observability 101 Ebook
Best practices for observing ML models in production

Monitoring Model Performance
The playbook to monitoring your model's performance in production
Baseline
A baseline is the reference data or benchmark used to compare model performance against for monitoring purposes. Baselines can be training data, validation data, prior time periods of production data, a previous model version, among others.
Additional resources

Definitive ML Observability Checklist
A buyer's guide on potential product and technical requirements for ML observability platforms

Arize Platform Demo
Covers baseline setup
Baseline Distribution
A baseline distribution refers to a model dataset used as a reference or comparison to the model’s current (production) distribution. Baseline distributions in Arize AI’s platform can be training datasets, validation datasets, or prior time periods of production.
Additional resources

Definitive ML Observability Checklist
A buyer's guide on potential product and technical requirements for ML observability platforms

Setting a Baseline
Baseline setup with Arize
Binary Classification Model
Binary classification refers to machine learning algorithms that have classification tasks that have only and only two class labels. Binary classification involves one “positive” and one “negative” class state in general.
Additional resources

ML Observability 101 Ebook
Best practices in ML observability for binary classification models

A Quick Start Guide To Data Quality Monitoring for ML
Identifying hard failures in your data pipelines
Binning
Binning is a way to group a number of continuous values together into smaller cohorts or “bins”. The technique helps reduce the cardinality of data by representing the points in intervals — for example, age ranges.
Additional resources

Using Statistical Distances In ML
An overview of the types of bins and the use of binning in machine learning
Canary Deployment
A method of testing a new model or model version where only a small subset of production data flows through this model to verify response performance before making a complete cutover. This technique allows for deeper analysis and understanding of model behavior and can minimize risk that regressions severely impact the business or customers.
Additional resources

Move Fast Without Breaking Things In ML
An overview on managing change, including use of canary deployments

Model Deployment and Serving
Things to look for in model servers
Classification Model
Classification models are used to predict categories or assign a class label. Any given data is classified into a set of categories or groups to determine its further use or for processing needs.
Data that can be classified into one category or a second category is known as binary data. For example, fraud or not fraud, male or female.
If the set of data can be classified into a number of categories or groups, each based on a different criterion, such data is known as multiclass data. For example, education level, household income.
Additional resources

The Model's Shipped; What Could Possibly Go Wrong?
Identifying model failure modes with classification models
Confusion Matrix
A confusion matrix provides a summary of all prediction results of a classification problem. Each result is shown with its corresponding number of correct/incorrect predictions (see True Positive, True Negative, False Positive, False Negative), count values and classification criteria. By providing a neat summary of all possible results, the confusion matrix lets you know the ways your classification model could get confused when making the predictions. It helps identify errors and the type of errors made by the model and thus helps improve the accuracy of the classification model.
Additional resources

Understanding Bias in ML Models
Utilizing a confusion matrix
Current Distribution
Current Distribution refers to the statistical distribution, or shape, of the dataset being generated by a machine learning model in production. Distribution of datasets in machine learning models are represented in the form of functions that show the relationships between the various observations, visually presented in the form of curves or graphs.
Additional resources

Using Statistical Distances In ML
Measuring drift

Model Drift
Onestop shop for all things driftrelated.
Data Quality
Data quality refers to the integrity and consistency of the data sets used. In monitoring machine learning performance, data quality measures include attributes such as missingness, out of range, P1 and P99, type mismatch, among others.
Additional resources

Data Quality Monitoring
A quick start guide

Solving Data Quality
Ensuring high quality for structured data with ML observability
Deep Learning
Machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications. Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems inspired by the human brain’s network of neurons.
Deep Learning Model
A deep learning model normally refers to a neural network, typically with more than two layers. Deep learning is usually used for computationally dense tasks like computer vision (images) and natural language processing.
Drift
Drift is defined as the change in the data over time. It also means the change in the properties of the target variable, due to unpredictable or unforeseen changes, over the due course of time.
 Data drift can be described as the change in the distribution of data, between the realtime data and the baseline data that was predicted or set beforehand.
 Concept drift is the change between the relationship between input and the output given in any situation.
Drift can be in any form. It can be gradual, recurring, or sudden. It can be a positive or negative drift. The change in data over time can affect model outcomes, making drift an important metric to monitor when it comes to model performance.
Additional resources

Take My Drift Away
How to troubleshoot and resolve the underlying issue when drift occurs

Model Drift
A onestop shop for all things driftrelated

Monitor Drift
How to troubleshoot drift with Arize
Embedding
In natural language processing (see definition of ‘natural language processing), embedding is a term used for the representation of words for text analysis, typically in the form of a realvalued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
Evaluation (Inference) Store
A machine learning infrastructure tool to monitor and improve model performance. Think of them as the ledger or log of model activities/inferences. Evaluation stores are used to:
 Surface up performance metrics in aggregate (or slice) for any model, in any environment — production, validation, training
 Monitor and identify drift, data quality issues, or anomalous performance degradations using baselines
 Enable teams to connect changes in performance to why they occurred
 Provide a platform to help deliver models continuously with high quality and feedback loops for improvement — compare production to training
 Provide an experimentation platform to A/B test model versions
Additional resources

The Only Three ML Tools You Need
More on the evaluation store in context
Evaluation Metric
The way the performance of a predictive model is quantified and calculated is known as the evaluation metric. It is used to evaluate the accuracy and the performance of the model used.
Evaluation Window
The evaluation window is plot of the period or duration of time against the metric being calculated. For instance, the previous 30 days. Any evaluation metric that can be represented as a duration of time can be visualized as an evaluation window.
Explainability
The total extent to which the machine learning internal mechanics can be explained in humanunderstandable terms only. It is simply the process of explaining the reasons behind the machine learning aspects of output data.
See ‘SHAP’.
Additional resources

Explainability Primer
A deeper dive into global, cohort and local explainability across ML lifecycle

Using Explainability
How to use tackle explainability with Arize
Fscore
Measure of the harmonic mean of precision and recall. Fscore is a result of integrating these parameters into one for a better understanding of the accuracy of the model. Fscore can be modified into F, 0.5, 1, & 2 based on the measure of weightage given to precision over recall.
Additional resources

Monitoring Model Performance
Monitoring model performance in production using F1
False Negative
When a model mistakenly predicts a negative class, when the value belongs to the positive class.
Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a fraudulent transaction
Additional resources

Best Practices In ML Observability for Monitoring, Mitigating and Preventing Fraud
Realworld usecases: fraud models

Best Practices In ML Observability for ClickThrough Rate Models
Realworld usecases: click through rate models
False Positive
When a model mistakenly predicts a positive class, when the value belongs to the negative class.
Example: A model flags a credit card transaction as ‘fraud’ when it was not actually a fraudulent transaction.
Feature Importance
Feature importance is a compilation of a class of techniques that take in all the features related to making a model prediction and assign a certain score to each feature to weigh how much or how little it impacted the outcome. These scores can then be used to better understand the internal logic of a model, make necessary changes to the model to improve its accuracy, and also reduce unnecessary inputs.
Additional resources

Model Explainability
A primer on global, cohort, and local explainability
Feature Performance Heat Map
A feature performance heat map is a visual representation of the performance of each feature in a given model. It enables users to quickly see slices of performance or features that perform significantly better or worse than others for faster triangulation of issues. Heat maps are especially useful when troubleshooting.
Feature Store
A machine learning infrastructure tool that handles offline and online feature transformations. Think of them as the interface between your models and data. Feature stores are used to:
 Serve as the central source for feature transformations
 Allow for the same feature transformations to be used in both offline training and online serving
 Enable team members to share their transformations for experimentation
 Provide a strong versioning for feature transformation code
Additional resources

Feast + Arize Partnership
Supercharging feature management and model monitoring for MLOps
JS Distance
JS distance is a symmetric derivation of KL divergence, and it is used to measure drift. In addition to being an actual metric (as opposed to KL), it is bounded by the square root of ln(2). For two distributions P and Q, the formula for JS distance is shown below. Use JS distance to compare distributions with low variance.
Additional resources

Using Statistical Distances for Machine Learning
Common Metrics for Drift and Where To Use Them

When I Drift, You Drift, We Drift
Understanding Different Types of Model Drift

Take My Drift Away
Strategies for Responding to Model Drift
KL Divergence
The KullbackLeibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution. KL divergence is sometimes referred to as ‘relative entropy’ and best used when one distribution is much smaller in sample and has a large variance.
Additional resources

Using Statistical Distances for Machine Learning
When and where to use KL Divergence

Take My Drift Away
How to measure drift

Model Monitoring for Drift
Drift analysis techniques
Kolmogorov Smirnov Test
Useful in drift monitoring, the Kolmogorov Smirnov test (KS test) is a widely used nonparametric technique for comparing difference between a sample with a baseline or reference probably distribution). KS test is an efficient drift metric to measure if two distributions significantly differ from one another. The Kolmogorov Smirnov statistic quantifies the maximum distance between two cumulative distribution functions.
Additional resources
LIME
LIME, or “Local Interpretable ModelAgnostic Explanations,” is an explainability method that attempts to provide local ML explainability. At a high level, LIME attempts to understand how perturbations in a model’s inputs affect the endprediction of the model. Since it makes no assumptions about how the model reaches the prediction, it can be used with any model architecture, hence the “modelagnostic” part of LIME. The LIME explainability approach takes a single input value of predictions and perturbs the inputs around those values. It then builds a linear model off of the feature perturbations where the coefficients are the feature importances at this local prediction.
Additional resources

Model Explainability
A deeper dive into global, cohort and local explainability across ML lifecycle including LIME

LIME Paper
Explaining the predictions of any classifier
Logarithmic Loss
Tracks incorrect labelling of the data class by the model and penalizes the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values.
MAE
Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset. Unlike MSE, MAE is weighted on a linear scale and therefore doesn’t put as much weight on outliers. This provides a more even measure of performance, but means large errors and smaller errors are weighted the same. Something to consider depending on your specific model use case.
MAPE
Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE. MAPE measures the average magnitude of error produced by a model, or how far off predictions are on average. See MAE for considerations when using this metric.
Model Performance
The performance of a machine learning model indicates its usability and ability to provide accurate results. Performance is usually measured in terms of metrics that apply to the specific type of machine model concerned. Here are some common metrics used according to the type of machine model:
 Regression based machine learning models – MSPE, MSAE, R Squared and Adjusted R Squared
 Classification – PrecisionsRecall, ROCAUC, Accuracy, logloss
 Unsupervised models – Rand index, Mutual information
Additional resources

Performance Monitoring
The playbook for monitoring model performance in production

Monitoring Model Performance
The only 3 ML tools you need
Model Store
A machine learning infrastructure tool that serves as central model registry and track experiments. Think of them as the library or catalog of your models. Model stores are used to:
 Serve as a central repository of all models and model versions
 Allow for reproducibility of every model version
 Track lineage of models history
Monitor Threshold
Monitor threshold refers to the value set for a model monitor, beyond which the model’s monitoring status will be triggered accordingly. The threshold value can be set on any specific performance metric such as accuracy, MSE, MAPE, etc.
MSE
Mean Square Error a regressive loss measure. The MSE is measured as the difference between the model’s predictions and ground truth, squared and averaged out across the dataset. It is used to check how close the predicted values are to the actual values. As in RMSE, a lower value indicates a better fit, and it heavily penalizes large errors or outliers.
Natural Language Processing (NLP)
Natural language processing (NLP). The inputs to these models are typically sentences, such as: “This definition is so informative.” These inputs are broken up into tokens: “This” “definition” “is” “so” “informative.” Most commonly, a classification model runs on top of NLP.
Performance Impact Score
Performance impact score is a measure of how much worse your metric of interest is on the slice compared to the average. Sorting by performance impact score enables you to narrow in on a slice impacting performance (i.e. accuracy).
Additional resources

What Is ML Performance Tracing?
See definitions and realworld examples of ML performance tracing and performance impact score.
Performance Slice
A performance slice is a subset of model values of interest in performance analysis and troubleshooting. Slices can be formed from any model dimension, including specific periods of time, set of features, etc. Performance slice analysis is useful when the goal is to understand or troubleshoot a cohort of interest, such as with bias detection, where the generalized dataset might mask statistical nuances.
Population Stability Index (PSI)
Population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time. PSI is calculated as: PSI = (% Actual – % Expected) x ln(% Actual / % Expected)
The larger the PSI, the less similar your distributions are, which allows you to set up thresholding alerts on the drift in your distributions. PSI is a great metric for both numeric and categorical features where distributions are fairly stable.
Additional resources

Using Statistical Distances for Machine Learning
A white paper on commonly used statistical distance metrics, including PSI
PR Curve
The PrecisionRecall curve is the correlation between the precision and recall at particular cutoff values, with the cut off values being set according to the particular model.
Precision
Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class precision = true positives / (predicted true positives + predicted false positives)
Example: There are 100 credit card transactions; 80 transactions are legitimate (positive class) and 20 transactions are fraudulent. If your model predicts that 85 transactions are legitimate, its precision is: 94.12% = 80 true positives / (80 true positives + 5 false positives)
Additional resources

Monitoring Model Performance
Playbook for performance analysis of production models
Prediction Drift Impact
The product of feature importance and drift (population stability index — PSI).
Additional resources

Model Drift
Learn what constitutes model drift, how to monitor for drift in machine learning models, and drift resolution techniques.

Using Statistical Distances for Machine Learning
Use cases for statistical distance checks across model inputs, model outputs and actuals.

Take My Drift Away
How to troubleshoot model drift.

When I Drift, You Drift, We Drift
The difference between concept, data, and model drift.
Quantile
Quantiles are the points dividing the range of a probability distribution into intervals with equal probabilities.
Additional resources

Statistical Distances for ML
A white paper on commonly used statistical distance metrics
Recall
Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives) recall = predicted true positives / (true positives + false negatives)
Example: There are 100 credit card transactions; 90 transactions are legitimate (positive class) and 10 transactions are fraudulent. If your model predicts that 80 transactions are legitimate and 20 transactions are fraudulent, its recall is:
88.89% = 80 true positives / (80 true positives + 10 false negatives)
Additional resources

Monitoring Model Performance
The playbook for using recall and other model metrics to monitor your model's performance in production

ML Observability 101 Ebook
Best practices for observing ML models in production
Regression
Regression analysis is a fundamental concept in data science and machine learning. It helps quantify the relationship between the inputs into a model and its outputs. Essentially, it is an estimation of how a variable affects a set of independent variables.
RMSE
Root Mean Square Error (also known as root mean square deviation, RMSD) is a measure of the average magnitude of error in quantitative data predictions. It can be thought of as the normalized distance between the vector of predicted values and the vector of observed (or actual) values.
Because errors are squared before averaged, this measure gives higher weight to large errors, and therefore useful in cases where you want to penalize models accordingly.
ROC – AUC
The Receiver Operating Characteristics (ROC) is a probability curve plotted between true positive rate (TPR) and false positive rate (FPR). Area Under the Curve (AUC) is an aggregate measure of performance across all possible classification thresholds. Together, ROC – AUC represents the degree of separability, or how much a model is capable of distinguishing between classes.
The higher the AUC (i.e. closer to 1), the better the model is at predicting 0 class as 0, and 1 class as 1.
Additional resources

What Is AUC?
A quickandintuitive explanation of AUC  includes gifs that animate how the ROC curve is constructed.
Score Models
Score models generate a numeric value as its prediction or output. For example, the likelihood that an input belongs to a category.
Sensitivity
Sensitivity is a measure of the number of positive cases that turned out to be true for a given model. It is also called the true positive rate.
sensitivity = predicted true positives / (true positives + false negatives)
Shadow Deployment
A method of testing a candidate model for production where production data runs through the model without the model actually returning predictions to the service or customers. Essentially, simulating how the model would perform in the production environment.
Additional resources

Move Fast Without Breaking Things In ML
An overview on managing change, including use of shadow deployments
SHAP
SHAP stands for “Shapley Additive Explanations”, a concept derived from game theory and used to explain the output of machine learning models (see definition of ‘Explainability’). SHAP values help interpret how much a given feature or input contributes, positively or negatively, to the target outcome or prediction. See ‘Feature Importance’
Additional resources

Model Explainability
A primer on global, cohort, and local explainability

Using Explainability
How to use SHAP with Arize
Specificity
Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives). This measure is similar to recall, but describes the offset in correcting predicting negative values. It is also called the true negative rate.
specificity = predicted true negatives / (true negatives + false positives)
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent (negative class). If your model predicts that 20 transactions are fraudulent, its recall is:
50% = 10 true negatives / (10 true negatives + 10 false positives)
Tabular Data
Data in a table format, with columns and rows. Inputs of the model in a table format (i.e. an Excel spreadsheet), where columns might be feature inputs (i.e city, state, charge amount). NLP and images do not fit in an excel sheet, since inputs are sentences or images.
Tags
Often used alongside model features, tags enable metadata support for slicing and cohorting. They are a convenient workaround to analyze groups of metadata that are important, but that an ML team might not want to send as an input to a model. In other words, tags avoid conflating two separate entities — features and other metadata — while empowering deep model analysis across cohorts of any kind.
Additional resources
True Negative
When a model correctly predicts a negative class, when the value belongs to the negative class.
Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a legitimate transaction.
Additional resources

Best Practices In ML Observability for ClickThrough Rate Models
Realworld usecases: click through rate (CTR) models

Best Practices in ML Observability for Monitoring, Mitigating and Preventing Fraud
Realworld usecases: fraud models
True Positive
When a model correctly predicts a positive class, when the value belongs to the positive class.
Example: A model flags a credit card transaction as ‘fraud’ when it is actually a fraudulent transaction.
Additional resources

Best Practices for Monitoring Fraud Models
Realworld usecases: fraud models

Best Practices for Monitoring Click Through Rate Models
Realworld usecases: click through rate (CTR) models