R Squared: Understanding the Coefficient of Determination

Ejiro Onose
Ejiro Onose,  Contributor  | Published August 08, 2023

Introduction

The R-squared metric — R², or the coefficient of determination – is used to measure how well a model fits data, and how well it can predict future outcomes. Simply put, it tells you how much of the variation in your data can be explained by your model. The closer the R-squared value is to one, the better your model fits the data.

In this article, we will discuss what R squared is, the math behind this metric, where it is useful and where it is not useful. We’ll also look at some practical examples of using the R-squared metric in machine learning. By the end of this article, you should have a better understanding of when and why to use the R-squared metric as part of your machine learning models.

What Is R Squared

R-squared, also known as the coefficient of determination, is a statistical measure used in machine learning to evaluate the quality of a regression model. It measures how well the model fits the data by assessing the proportion of variance in the dependent variable explained by the independent variables.

R-squared indicates the percentage of variation in your target variable that can be explained by your independent variables. It measures how closely related each variable is to one another and how predictable each dependent variable is from its respective set of independent variables.

The R-squared value ranges from 0 to 1, with a value of 1 indicating a perfect fit of the model to the data, while a value of 0 indicates that the model does not explain any of the variability in the dependent variable.

A high R-squared value would usually indicate that the model explains a large proportion of the variability in the data, while a low R-squared value suggests that the model does not explain much of the variability.

How do you calculate the coefficient of determination?

The R-squared metric is a post-metric that is calculated using other metrics. R-squared is calculated mathematically by comparing the Sum of Squares of Errors (SSE) or the Sum of Squared Residuals (SSR) to the Total Sum of Squares (SST). Note: SSE and SSR can be used interchangeably.

R-squared can be calculated using the following formula:
R² = 1 – (SSE/SST)

SSE is the sum of the squared differences between the actual dependent variable values and the predicted values from the regression model. It represents the variability that is not explained by the independent variables.

SST is the total variation in the dependent variable and is calculated by summing the squared differences between each actual dependent variable value and the mean of all dependent variable values.

Once SSE and SST are calculated, R-squared is determined by dividing SSE by SST and then subtracting the result from 1. The resulting value represents the proportion of the total variation in the dependent variable that is explained by the independent variables.

To interpret an R-squared value, look at how close it is to 1. A value of 0 indicates that your model does not explain any of the variability of your data, while a value of 1 suggests that your model perfectly explains all of the variability in your data. Values between 0 and 1 suggest that your model is able to explain some variability in your data, but not all.

In order to decide whether or not a predictive model has been accurately fitted with an R-squared value, first consider other factors such as Mean Absolute Error (MAE) and Root Mean Square error (RMSE). Both MAE and RMSE are measures of how far off your predictions are from actual values. While they do not provide a measure of how well or poorly a model fits the data, they can be used in conjunction with R-squared to make an overall assessment of its quality.

By comparing different models with different coefficients and parameters, analysts can identify which model best fits their data by assessing its R-squared value. The higher the R-squared value, the more closely it matches observed data — thus making it a good indicator of how accurate a given model is in predicting future outcomes.

The R-squared value can also be calculated using software packages such as Python, R, or Excel.

In Python, R-squared can be computed from the Sci-kit learn library. Here’s an example:

from sklearn.metrics import r2_score

r2 = r2_score(y_test,y_pred)
print(f"R-squared value = {r2}")

Limitations and Misconceptions of the Coefficient of Determination

While the R-squared metric is a useful tool for measuring the accuracy of a machine learning model, it has some limitations. One of the main drawbacks of R-squared is that it assumes that all variables in the model are independent, which is not always the case.

Another potential limitation or drawback is that the metric also has trouble detecting non-linear relationships and can give misleading results when working with smaller datasets.

Ultimately, R-squared is only one measure of accuracy – other metrics such as Mean Absolute Error or Root Mean Square Error may be more appropriate for certain contexts.

Several prominent misconceptions exist about R squared. The first is that a high value of R-squared implies that the regression model is useful for predicting new observations. However, this is not necessarily true, as R-squared only represents the proportion of variation in the sample data that is explained by the regression model, and may not accurately reflect the proportion of variation in the population that can be explained by the model. The accuracy of R-squared as an estimate of the population proportion is affected by the technique used to select terms for the model. If the selection process allows insignificant terms in the model, then R-squared may have a bias toward high values, leading to an overfitted model that does not generalize well to new observations from the population.

The second misconception is that a low value of R-squared implies that the regression model is not useful. However, this is not necessarily true either, as a low value of R-squared may reflect the exclusion of significant terms (Type II error) from the model. In this case, R-squared may have a bias toward low values, leading to an underfitted model that does not capture important relationships in the data.

Additionally, a higher R-squared value does not always equate to better predictions – as a rule of thumb, values over 0.8 should be treated with caution.

Ultimately, the best way to use and understand R-squared is to experiment with different models and compare the results. With practice and experience, you will soon become familiar with this powerful metric and be able to leverage it for robust machine learning solutions.

Adjusted R² Definition and Formula

The adjusted R-squared metric takes into account the number of parameters used to calculate the model. It is calculated by subtracting the proportion of variance unaccounted for by the model, from 1 (the total variance in the data).

The formula for adjusted R-squared is:


where n is the number of observations and p is the number of independent variables in the model.
The adjusted R-squared can be a better measure of predictive power than the R-squared because it penalizes additional parameters and reduces the overfitting of models to data. With an adjusted R-squared metric, more complex models are more likely to show lower scores, as more parameters will lead to an increase in variance that is not explained by the model, and thus a lower score.

This helps to identify models that have high predictive power without adding unnecessary parameters that do not contribute significantly to the explanation of variance in the data.

In Action: Comparing Regression Models

The R-squared metric and its more precise counterpart, the adjusted r-squared metric, allow you to easily compare regression models and determine which one performs better.

While the R-squared metric measures the proportion of variance in the dependent variable that can be explained by the independent variables, the adjusted R-squared compensates for additional independent variables by factoring in the ratio of observations to variables. The former is calculated by comparing the sum of squares of errors (SSE) to the total sum of squares (SST) and is expressed as a percentage. A higher number indicates a better fit of the model. The latter helps to determine whether adding more variables improves the model’s accuracy and if the increase in explanatory power justifies adding additional variables.

Both metrics are useful in gauging how well your regression models are performing so you can get an accurate representation of your data. By comparing multiple models side-by-side with both metrics, you can easily identify which model has a better fit and make informed decisions about how to use it for predictive analytics.

Where is R Squared Most and Least Useful?

R-squared is most useful in industries that rely on regression models, such as finance, economics, marketing, and engineering.

In finance, R-squared can be used to evaluate the performance of asset pricing models. In marketing, R-squared might be used to measure the effectiveness of advertising campaigns. In engineering, R-squared can be used to evaluate the accuracy of predictive maintenance models.

The advantage of using this metric lies in its ability to measure variability: it describes how much variation is explained by the model itself compared to the base case that could have been used. This makes it invaluable for determining whether an algorithm is performing accurately and efficiently, as well as for comparing two different algorithms’ performances side-by-side.

R-squared is not ideal when it comes to certain machine learning models such as those involving non-linear regression or time series prediction. Another metric called the root mean squared error (RMSE) might be used as an alternative in some cases. RMSE is a measure of model accuracy that takes into account the size of the errors in predictions made by a machine learning model. It measures the average of the difference between predicted and actual values and can be helpful for comparing machine learning models. Since it penalizes large errors much more than small errors, it allows for better differentiation between similarly performing models with large error values being less favored than those with smaller ones. This makes it an excellent metric for use in scenarios where the cost of incorrect prediction needs to be considered, such as predicting electricity demand.

Hands-On Exercise of a Linear Regression Model Using R-squared Metric

Let’s consider a simple linear regression model for predicting sales based on the money spent on different platforms for marketing and use R-squared to evaluate its performance with Arize. Check out the model here.

Prerequisites

Before we begin, make sure you have the following:

  • An Arize account. If you don’t have one, you can sign up at https://app.arize.com/auth/login.
  • Your Arize space key and API key. These can be found in your Arize account settings or API documentation.

Utilizing Arize with a Regression Model

Let’s go through the process step by step.

Step 1: Install the Arize Python Library

To get started, we need to install the Arize Python library. On your Jupyther notebook, run the following command:

!pip install arize

Import the Required Libraries

On the Jupyther notebook, import the necessary libraries. We need the arize.pandas.logger.Client class from the Arize library to log our data in Arize.

import datetime
from arize.pandas.logger import Client
from arize.utils.types import ModelTypes, Environments, Schema, Metrics

Load, and Prepare the Data

Next, load your regression dataset into a pandas DataFrame and perform any necessary preprocessing or feature engineering steps. In this example, it is assumed that you already have a DataFrame named df containing the dataset.

Add Timestamps for Predictions

Arize allows you to track the timestamps of your predictions. Generate sample timestamps for each prediction using the datetime module. This step is optional but recommended. Here’s an example:

current_time = datetime.datetime.now().timestamp()
earlier_time = (datetime.datetime.now() - datetime.timedelta(days=30)).timestamp()
optional_prediction_timestamps = np.linspace(earlier_time, current_time, num=df.shape[0])
df.insert(0, "prediction_ts", optional_prediction_timestamps.astype(int))

Build the Regression Model

After loading and preparing the data, the next step is to build the regression model. To build the model:

x = df2.drop('Sales',axis = 1)
y = df2['Sales']

# split data for training and testing purpose

from sklearn.model_selection import train_test_split

x_train,x_test, y_train,y_test = train_test_split(x,y, test_size=0.25,random_state=5)
x_train.shape,x_test.shape, y_train.shape,y_test.shape

"""model training"""

from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(x_train,y_train)

Evaluating the Regression Model

After training the model, it is important to evaluate its performance using appropriate metrics.

from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

y_pred = lin_model.predict(x_test)

mse = mean_squared_error(y_test,y_pred)  
print(f"MSE = {mse}")
print(f"RMSE = {np.sqrt(mse)}")

mae = mean_absolute_error(y_test,y_pred)
print(f"MAE = {mae}")

r2 = r2_score(y_test,y_pred)
print(f"R-squared value = {r2}")

Adj_r2 = 1 - (1-r2_score(y_test, y_pred)) * (len(y_test)-1)/(len(y_test)-x.shape[1]-1)
print(f"Adjusted R-squared value = {Adj_r2}")

Define the Model Schema

Before logging the data in Arize, define the model schema using the Schema class from Arize. The schema specifies the column names and types of your data. Here is an example.

feature_column_names = ["prediction_ts", "TV", "Radio", "Newspaper", “y_prediction”]
schema = Schema(
    prediction_id_column_name="Sales",
    timestamp_column_name="prediction_ts",
    prediction_score_column_name="Sales",
    actual_score_column_name="Sales",
    feature_column_names=feature_column_names
)

Make sure to adjust the column names based on your dataset.

Create an Arize Client

Create an instance of the Client class from Arize by passing your Arize space key and API key. These keys can be obtained from your Arize account settings.

SPACE_KEY = "YOUR_SPACE_KEY"
API_KEY = "YOUR_API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)

Log Data to Arize

Now we are ready to log our data to Arize.
In order to do this, use the log() method of the arize_client object. Pass the DataFrame, model schema, model ID, model version, model type, metrics validation, and other relevant information. Here’s an example:

response = arize_client.log(
    dataframe=df,
    schema=schema,
    model_id="simple-regression-model",
    model_version="1.0.0",
    model_type=ModelTypes.REGRESSION,
    metrics_validation=[Metrics.REGRESSION],
    validate=True,
    environment=Environments.PRODUCTION
)

Verify the Logging Status

After logging the data, you can check the status of the logging operation by inspecting the response object. If the status code is 200, it means the data was successfully logged. Otherwise, there may be an error. You can print the status code and any error message to troubleshoot the issue.

if response.status_code == 200:
    print("Data logged successfully!")
else:
    print(f"Logging failed with status code {response.status_code} and message: {response.text}")

Once successful, you will get a prompt like an image below that contains the link to the model and the data in Arize.

successfully logged data to arize

Click the link to open your Arize space where the model and data have been logged.

Now we can set up a monitor for the model, perform root cause analysis, and also find the slice causing a dip in performance.

Monitor Your Model in Arize

Once you open the link, it takes you to your Arize space, which should look like the image below. Arize provides three ways to configure monitors:

We will be using the managed monitors here.

Once your Arize dashboard is opened, select your model.

select your model

simple regression model monitoring

Navigate to the Monitor tab on your Arize Dashboard. Select Setup Monitors and you would see a host of already prepared monitors. For this exercise, we will enable the R-Squared monitor.

set up monitors in arize regression

After selecting the monitor(s), go to the Monitor Listing to see the enabled monitor(s).

see enabled monitors

Then, select the R Squared Monitor to configure it.

configure r squared monitor

When Configuring, Arize allows you to set up the evaluation window, alert notification, and model baseline. After setup, your monitor is good and ready to monitor the R-squared metric!

From the code, the regression model has been set to run with timestamp data to mimic a model in production. This was done because we’d like to monitor and investigate the R-squared metric of the regression model. For this example, the average R-squared value is 0.9092, which is a good score.

arize r squared monitor example

We can also compare different dates and see their values. To do this:

  • Click the Performance Tracing tab >> Add comparison. Then add/edit the dates to compare with.

compare dates regression model monitoring

Comparing the R-squared metric in April and May, we observe that the metric drops by 2.16%, from 0.9092 to 0.8899.

regression model comparison in arize ml observability platform

Digging deeper, there are certain dates where the R-squared metric drops. For example, on the 29th of April, the metric dropped by 11.23% from the average R-squared metric.

root cause analysis in model monitoring platform

With these investigations, we can then look into the model and data for the cause.

Note: It is important to use the R-squared metric with other performance metrics while doing these investigations to have a comprehensive understanding of what is going on with your model in production.

Conclusion

R-squared metric is an important tool in the arsenal of machine learning models. It is especially useful when it comes to evaluating regression models, which make predictions of a continuous variable (like sales prices) from training data.