marquee background>

ML Monitoring & Data Science Glossary

Learn more about data science and machine learning (ML) monitoring with this glossary from Arize. Find definitions to common terms below.


Accuracy is the measure of the number of correct predictions made by the model. It is derived by calculating the percentage of correct predictions out of overall predictions. accuracy = correct predictions / all predictions

Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent. If your model predicts that 95 transactions are legitimate and 5 transactions are fraudulent, its accuracy is:
95% = (90 correct legitimate + 5 correct fraudulent) / 100 transactions

Additional resources
Machine Learning Baseline


A baseline is the reference data or benchmark used to compare model performance against for monitoring purposes. Baselines can be training data, validation data, prior time periods of production data, a previous model version, among others.

Additional resources
Machine Learning Baseline

Baseline Distribution

A baseline distribution refers to a model dataset used as a reference or comparison to the model’s current (production) distribution. Baseline distributions in Arize AI’s platform can be training datasets, validation datasets, or prior time periods of production.

Additional resources
Baseline Distribution

Binary Classification Model

Binary classification refers to machine learning algorithms that have classification tasks that have only and only two class labels. Binary classification involves one “positive” and one “negative” class state in general.

Additional resources
Binary Classification Model


Binning is a way to group a number of continuous values together into smaller cohorts or “bins”. The technique helps reduce the cardinality of data by representing the points in intervals — for example, age ranges.

Additional resources

Canary Deployment

A method of testing a new model or model version where only a small subset of production data flows through this model to verify response performance before making a complete cutover. This technique allows for deeper analysis and understanding of model behavior and can minimize risk that regressions severely impact the business or customers.

Additional resources
Canary Deployment

Classification Model

Classification models are used to predict categories or assign a class label. Any given data is classified into a set of categories or groups to determine its further use or for processing needs.

Data that can be classified into one category or a second category is known as binary data. For example, fraud or not fraud, male or female.

If the set of data can be classified into a number of categories or groups, each based on a different criterion, such data is known as multi-class data. For example, education level, household income.

Additional resources
Classification Model

Confusion Matrix

A confusion matrix provides a summary of all prediction results of a classification problem. Each result is shown with its corresponding number of correct/incorrect predictions (see True Positive, True Negative, False Positive, False Negative), count values and classification criteria. By providing a neat summary of all possible results, the confusion matrix lets you know the ways your classification model could get confused when making the predictions. It helps identify errors and the type of errors made by the model and thus helps improve the accuracy of the classification model.

Additional resources
Confusion Matrix graphic

Current Distribution

Current Distribution refers to the statistical distribution, or shape, of the dataset being generated by a machine learning model in production. Distribution of datasets in machine learning models are represented in the form of functions that show the relationships between the various observations, visually presented in the form of curves or graphs.

Additional resources
Current Distribution graphic

Data Quality

Data quality refers to the integrity and consistency of the data sets used. In monitoring machine learning performance, data quality measures include attributes such as missingness, out of range, P1 and P99, type mismatch, among others.

Additional resources
Data Quality graphic

Deep Learning

Machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications. Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems inspired by the human brain’s network of neurons.

Deep Learning graphic

Deep Learning Model

A deep learning model normally refers to a neural network, typically with more than two layers. Deep learning is usually used for computationally dense tasks like computer vision (images) and natural language processing.

Deep Learning Model graphic


Drift is defined as the change in the data over time. It also means the change in the properties of the target variable, due to unpredictable or unforeseen changes, over the due course of time.

  • Data drift can be described as the change in the distribution of data, between the real-time data and the baseline data that was predicted or set beforehand.
  • Concept drift is the change between the relationship between input and the output given in any situation.

Drift can be in any form. It can be gradual, recurring, or sudden. It can be a positive or negative drift. The change in data over time can affect model outcomes, making drift an important metric to monitor when it comes to model performance.

Additional resources
Drift graphic


In natural language processing (see definition of ‘natural language processing), embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

Embedding graphic

Evaluation (Inference) Store

A machine learning infrastructure tool to monitor and improve model performance. Think of them as the ledger or log of model activities/inferences. Evaluation stores are used to:

  • Surface up performance metrics in aggregate (or slice) for any model, in any environment — production, validation, training
  • Monitor and identify drift, data quality issues, or anomalous performance degradations using baselines
  • Enable teams to connect changes in performance to why they occurred
  • Provide a platform to help deliver models continuously with high quality and feedback loops for improvement — compare production to training
  • Provide an experimentation platform to A/B test model versions
Additional resources
Evaluation (Inference) Store graphic

Evaluation Metric

The way the performance of a predictive model is quantified and calculated is known as the evaluation metric. It is used to evaluate the accuracy and the performance of the model used.

Evaluation Window

The evaluation window is plot of the period or duration of time against the metric being calculated. For instance, the previous 30 days. Any evaluation metric that can be represented as a duration of time can be visualized as an evaluation window.

Evaluation Window graphic


The total extent to which the machine learning internal mechanics can be explained in human-understandable terms only. It is simply the process of explaining the reasons behind the machine learning aspects of output data.
See ‘SHAP’.

Additional resources
Explainability graphic


Measure of the harmonic mean of precision and recall. F-score is a result of integrating these parameters into one for a better understanding of the accuracy of the model. F-score can be modified into F, 0.5, 1, & 2 based on the measure of weightage given to precision over recall.

Additional resources
F-score graphic

False Negative

When a model mistakenly predicts a negative class, when the value belongs to the positive class.

Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a fraudulent transaction

Additional resources
False Negative graphic

False Positive

When a model mistakenly predicts a positive class, when the value belongs to the negative class.

Example: A model flags a credit card transaction as ‘fraud’ when it was not actually a fraudulent transaction.

False Positive graphic

Feature Importance

Feature importance is a compilation of a class of techniques that take in all the features related to making a model prediction and assign a certain score to each feature to weigh how much or how little it impacted the outcome. These scores can then be used to better understand the internal logic of a model, make necessary changes to the model to improve its accuracy, and also reduce unnecessary inputs.

Additional resources
Feature Importance graphic

Feature Performance Heat Map

A feature performance heat map is a visual representation of the performance of each feature in a given model. It enables users to quickly see slices of performance or features that perform significantly better or worse than others for faster triangulation of issues. Heat maps are especially useful when troubleshooting.

Feature Performance Heat Map graphic

Feature Store

A machine learning infrastructure tool that handles offline and online feature transformations. Think of them as the interface between your models and data. Feature stores are used to:

  • Serve as the central source for feature transformations
  • Allow for the same feature transformations to be used in both offline training and online serving
  • Enable team members to share their transformations for experimentation
  • Provide a strong versioning for feature transformation code
Additional resources
Feature Store graphic

JS Distance

JS distance is a symmetric derivation of KL divergence, and it is used to measure drift. In addition to being an actual metric (as opposed to KL), it is bounded by the square root of ln(2). For two distributions P and Q, the formula for JS distance is shown below. Use JS distance to compare distributions with low variance. 

js distance math

Additional resources

KL Divergence

The Kullback-Leibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution. KL divergence is sometimes referred to as ‘relative entropy’ and best used when one distribution is much smaller in sample and has a large variance.

Additional resources
KL Divergence graphic

Kolmogorov Smirnov Test

Useful in drift monitoring, the Kolmogorov Smirnov test (KS test) is a widely used nonparametric technique for comparing difference between a sample with a baseline or reference probably distribution). KS test is an efficient drift metric to measure if two distributions significantly differ from one another. The Kolmogorov Smirnov statistic quantifies the maximum distance between two cumulative distribution functions.

Additional resources
ks statistic in machine learning


LIME, or “Local Interpretable Model-Agnostic Explanations,” is an explainability method that attempts to provide local ML explainability. At a high level, LIME attempts to understand how perturbations in a model’s inputs affect the end-prediction of the model. Since it makes no assumptions about how the model reaches the prediction, it can be used with any model architecture, hence the “model-agnostic” part of LIME. The LIME explainability approach takes a single input value of predictions and perturbs the inputs around those values. It then builds a linear model off of the feature perturbations where the coefficients are the feature importances at this local prediction.

Additional resources
  • Model Explainability

    A deeper dive into global, cohort and local explainability across ML lifecycle including LIME

  • LIME Paper

    Explaining the predictions of any classifier

LIME graphic

Logarithmic Loss

Tracks incorrect labelling of the data class by the model and penalizes the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values.


Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset. Unlike MSE, MAE is weighted on a linear scale and therefore doesn’t put as much weight on outliers. This provides a more even measure of performance, but means large errors and smaller errors are weighted the same. Something to consider depending on your specific model use case.

MAE graphic


Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE. MAPE measures the average magnitude of error produced by a model, or how far off predictions are on average. See MAE for considerations when using this metric.

MAPE graphic

Model Performance

The performance of a machine learning model indicates its usability and ability to provide accurate results. Performance is usually measured in terms of metrics that apply to the specific type of machine model concerned. Here are some common metrics used according to the type of machine model:

  • Regression based machine learning models – MSPE, MSAE, R Squared and Adjusted R Squared
  • Classification – Precisions-Recall, ROC-AUC, Accuracy, log-loss
  • Unsupervised models – Rand index, Mutual information
Additional resources
Model Performance graphic

Model Store

A machine learning infrastructure tool that serves as central model registry and track experiments. Think of them as the library or catalog of your models. Model stores are used to:

  • Serve as a central repository of all models and model versions
  • Allow for reproducibility of every model version
  • Track lineage of models history
Model Store graphic

Monitor Threshold

Monitor threshold refers to the value set for a model monitor, beyond which the model’s monitoring status will be triggered accordingly. The threshold value can be set on any specific performance metric such as accuracy, MSE, MAPE, etc.

Monitor Threshold graphic


Mean Square Error a regressive loss measure. The MSE is measured as the difference between the model’s predictions and ground truth, squared and averaged out across the dataset. It is used to check how close the predicted values are to the actual values. As in RMSE, a lower value indicates a better fit, and it heavily penalizes large errors or outliers.

MSE equation

Natural Language Processing (NLP)

Natural language processing (NLP). The inputs to these models are typically sentences, such as: “This definition is so informative.” These inputs are broken up into tokens: “This” “definition” “is” “so” “informative.” Most commonly, a classification model runs on top of NLP.

Natural Language Processing (NLP) graphic

Performance Impact Score

Performance impact score is a measure of how much worse your metric of interest is on the slice compared to the average. Sorting by performance impact score enables you to narrow in on a slice impacting performance (i.e. accuracy).

Additional resources

Performance Slice

A performance slice is a subset of model values of interest in performance analysis and troubleshooting. Slices can be formed from any model dimension, including specific periods of time, set of features, etc. Performance slice analysis is useful when the goal is to understand or troubleshoot a cohort of interest, such as with bias detection, where the generalized dataset might mask statistical nuances.

Performance Slice graphic

Population Stability Index (PSI)

Population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time. PSI is calculated as: PSI = (% Actual – % Expected) x ln(% Actual / % Expected)

The larger the PSI, the less similar your distributions are, which allows you to set up thresholding alerts on the drift in your distributions. PSI is a great metric for both numeric and categorical features where distributions are fairly stable.

Additional resources
Population Stability Index (PSI) graphic

PR Curve

The Precision-Recall curve is the correlation between the precision and recall at particular cut-off values, with the cut off values being set according to the particular model.

PR Curve graphic


Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class precision = true positives / (predicted true positives + predicted false positives)

Example: There are 100 credit card transactions; 80 transactions are legitimate (positive class) and 20 transactions are fraudulent. If your model predicts that 85 transactions are legitimate, its precision is: 94.12% = 80 true positives / (80 true positives + 5 false positives)

Additional resources
Precision graphic

Prediction Drift Impact

The product of feature importance and drift (population stability index — PSI).

Additional resources
Drift graphic


Quantiles are the points dividing the range of a probability distribution into intervals with equal probabilities.

Additional resources
Quantile graphic


Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives) recall = predicted true positives / (true positives + false negatives)

Additional resources
Recall graphic


Regression analysis is a fundamental concept in data science and machine learning. It helps quantify the relationship between the inputs into a model and its outputs. Essentially, it is an estimation of how a variable affects a set of independent variables.

Regression graphic


Root Mean Square Error (also known as root mean square deviation, RMSD) is a measure of the average magnitude of error in quantitative data predictions. It can be thought of as the normalized distance between the vector of predicted values and the vector of observed (or actual) values.

Because errors are squared before averaged, this measure gives higher weight to large errors, and therefore useful in cases where you want to penalize models accordingly.

RMSE equation


The Receiver Operating Characteristics (ROC) is a probability curve plotted between true positive rate (TPR) and false positive rate (FPR). Area Under the Curve (AUC) is an aggregate measure of performance across all possible classification thresholds. Together, ROC – AUC represents the degree of separability, or how much a model is capable of distinguishing between classes.

The higher the AUC (i.e. closer to 1), the better the model is at predicting 0 class as 0, and 1 class as 1.

Additional resources
  • What Is AUC?

    A quick-and-intuitive explanation of AUC -- includes gifs that animate how the ROC curve is constructed.

ROC - AUC graphic

Score Models

Score models generate a numeric value as its prediction or output. For example, the likelihood that an input belongs to a category.


Sensitivity is a measure of the number of positive cases that turned out to be true for a given model. It is also called the true positive rate.
sensitivity = predicted true positives / (true positives + false negatives)

Sensitivity graphic

Shadow Deployment

A method of testing a candidate model for production where production data runs through the model without the model actually returning predictions to the service or customers. Essentially, simulating how the model would perform in the production environment.

Additional resources
Shadow Deployment graphic


SHAP stands for “Shapley Additive Explanations”, a concept derived from game theory and used to explain the output of machine learning models (see definition of ‘Explainability’). SHAP values help interpret how much a given feature or input contributes, positively or negatively, to the target outcome or prediction. See ‘Feature Importance’

Additional resources
SHAP graphic


Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives). This measure is similar to recall, but describes the offset in correcting predicting negative values. It is also called the true negative rate.

specificity = predicted true negatives / (true negatives + false positives)

Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent (negative class). If your model predicts that 20 transactions are fraudulent, its recall is:
50% = 10 true negatives / (10 true negatives + 10 false positives)

Specificity graphic

Tabular Data

Data in a table format, with columns and rows. Inputs of the model in a table format (i.e. an Excel spreadsheet), where columns might be feature inputs (i.e city, state, charge amount). NLP and images do not fit in an excel sheet, since inputs are sentences or images.

Tabular Data graphic


Often used alongside model features, tags enable metadata support for slicing and cohorting. They are a convenient workaround to analyze groups of metadata that are important, but that an ML team might not want to send as an input to a model. In other words, tags avoid conflating two separate entities — features and other metadata — while empowering deep model analysis across cohorts of any kind.

Additional resources

True Negative

When a model correctly predicts a negative class, when the value belongs to the negative class.

Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a legitimate transaction.

Additional resources
True Negative graphic

True Positive

When a model correctly predicts a positive class, when the value belongs to the positive class.

Example: A model flags a credit card transaction as ‘fraud’ when it is actually a fraudulent transaction.

Additional resources
True Positive graphic

Subscribe to the Arize blog

Get the latest news, expertise, and product updates from Arize.

close icon