l

# F1 Score: Harmonize Your Model’s Performance Natasha Sharma,  Contributor |  Published March 03, 2023

F1 score is a measure of the harmonic mean of precision and recall. Commonly used as an evaluation metric in binary and multi-class classification, the F1 score integrates precision and recall into a single metric to gain a better understanding of model performance.

F-score can be modified into F0.5, F1, and F2 based on the measure of weightage given to precision over recall depending on your use case.

## Why Is F1 Score Useful?

There are a number of metrics useful for measuring performance of classification models. Accuracy is one of the simplest to understand. The accuracy metric measures the number of times a model was able to correctly identify the data class across the entire dataset. Accuracy is a reliable metric as long as your dataset has an equal number of samples for each class.

Nevertheless, this scenario is hard to imagine with the real world datasets. Imagine you have a dataset where 99% of the records belong to Class A and the rest (1%) to Class B – a common occurrence in areas like credit card fraud where almost all transactions are non-fraudulent. Regardless of the sample from this type of dataset, the accuracy will still be 99%. Leveraging accuracy is not ideal in this scenario.

For imbalanced data, F1 score comes to rescue. It takes into account the type of errors — false positive and false negative – and not just the number of predictions that were incorrect.

## How Do Precision and Recall Relate to F1 Score?

F1 score computes the average of precision and recall, where the relative contribution of both of these metrics are equal to F1 score. The best value of F1 score is 1 and the worst is 0. What does this mean? This means a perfect model will have a F1 score of 1 – all of the predictions were correct.

As precision and recall both are rates, F1 score uses harmonic mean rather than a common arithmetic average. Let’s see how it uses both of these metrics and why.

### Precision

Precision is a model performance metric that corresponds to the fraction of values that actually belong to a positive class out of all of the values which are predicted to belong to that class. Precision is also known as the positive predictive value (PPV).

F1 score uses precision to get the rate of true positive records among the total records classified as positive by machine learning model. Precision is calculated as below: Machine learning models may find a lot of the positives, but it can wrongly detect positives that aren’t actually positives. Sometimes, they can fail to find all the positives, and the ones identified as positive are likely to be correct.

### Recall

Recall is a performance metric that corresponds to the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives).

F1 score uses recall to get the fraction of true positive records among the total of actual positive records. Recall is calculated as below: A machine learning model with high recall means that there are positive cases in the data, though there can be negative cases identified as positive cases. Low recall indicates that it was not able to find any positive case.

## How Is the F1 Score Calculated Using Precision and Recall?

F1 score combines both of these precision and recall and symmetrically represents them in the formula: As mentioned above F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation into the correct class and 0 representing a model that is unable to classify any observation into the correct class.

Suppose you have a machine learning model to predict if a credit card transaction was fraudulent or not. The following confusion matrix summarizes the prediction made by the model:

 Total No. of Records165 Actual Positives Actual Negatives Predicted Positives True Positives 50 False Positives 10 Predicted Negatives False Negatives 5 True Negatives 100

You can calculate Precision and Recall as below:

Precision = 50 / 50+10 = 0.83

Recall = 50 / 50+5 = 0.91

Let’s put these together in F1 score formula:

F1 Score = 2 * (0.83 * 0.91) / 0.83 + 0.91 = 0.87

Our model has an F1 Score of 0.87, which is close to 1 (which means all predictions are correct). Using the harmonic mean in F1 Score, we find a balance of similar values for precision and recall. The more the precision and recall scores deviate from each other, the worse the F1 score will be.

## In Which Real-World Applications is F1 Score Most Useful?

F1 score is a common metric to use in binary and multiclass classification problems as it balances precision and recall. Below are a few examples of where it can be useful.

### Healthcare

In healthcare, F1 score can sometimes be useful for models that aim to suggest a diagnosis (i.e. from scanning electronic medical records). A high F1 score indicates that the model is good at identifying both positive and negative cases, which can be important for minimizing misdiagnosis and ensuring patients receive proper treatment.

### Fraud Detection

As a general rule, fraud accounts for a relatively small portion of transactions in the real world. In healthcare billing, for example, fraud is estimated to be 3% of total transactions. As a result, traditional accuracy metrics can be misleading. F1 score can be useful in such cases. However, practitioners may want to adjust the weight of the F score. In credit card fraud, for example, a misclassified fraudulent transaction (a false negative — predicting “not fraud” for a transaction that is indeed fraud) is often more costly given its direct impacts on revenue than a misclassified legitimate transaction (a false positive — predicting fraud for a transaction that is not fraud), where a customer is merely inconvenienced.

### Spam Classification

In email spam classification, it is important to both correctly identify spam emails while minimizing false positives (emails falsely flagged as spam). F1 score is often preferred over other metrics – such as accuracy, precision, or recall – for spam classification for that reason and because it is an imbalanced classification problem where the number of spam emails is much smaller than the number of non-spam emails.

## When Should F1 Score Be Avoided in Favor of Other Metrics?

It’s important to note F1 score’s limitations. While the F1 score offers a way to compare classifiers with a single metric, in the process important differences can get obscured since it assigns an equal weight to both precision and recall. In some domains, such as identifying faulty components for a crane or corrosion on a bridge, there cannot be false negatives. In these scenarios, it would make sense to have a model incorrectly flag dozens of perfectly safe components rather than risk letting even one unsafe component through the process. In such a scenario, recall might be more useful since it minimizes false negatives.

## Hands-On Exercise: Create a Classifier and Measure Its Performance with F1 Score

In the previous sections, we got familiar with F1 Score and learned how it can be used as a performance metric for a model when we have imbalanced data. In this section, we will go through steps of creating a classifier and how to measure its performance using F1 score and make the correct decision.

### Create A Classification Model

``````# Load the data from Kaggle

!pip install kaggle

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

!unzip creditcardfraud.zip

import pandas as pd
print(df.columns)  STEP TWO: Confirm the “Class” distribution

``````# Check class distribution in the data

class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))`````` As you can see, we only have 492 records classified as “Fraudulent.” Notice how imbalanced the dataset is. If we use this dataset, our model might end up overfitting and classify most of the data as “Not Fraud.”

STEP THREE: Most of the data is scaled, except Time and Amount. Let’s scale these:

``````from sklearn.preprocessing import StandardScaler, RobustScaler

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df.drop(['Time','Amount'], axis=1, inplace=True)

scaled_amount = df['scaled_amount']
scaled_time = df['scaled_time']

df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
df.insert(0, 'scaled_amount', scaled_amount)
df.insert(1, 'scaled_time', scaled_time)``````

STEP FOUR: Next, let’s split the data into training, validation and testing datasets to be able to train the model on different sub-samples of the dataset.

``````from sklearn.model_selection import train_test_split

X = df.drop('Class', axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42)

print("Length of X_train is: {X_train}".format(X_train = len(X_train)))
print("Length of X_test is: {X_test}".format(X_test = len(X_test)))
print("Length of X_val is: {X_val}".format(X_val = len(X_val)))

print("Length of y_train is: {y_train}".format(y_train = len(y_train)))
print("Length of y_test is: {y_test}".format(y_test = len(y_test)))
print("Length of y_val is: {y_val}".format(y_val = len(y_val)))`````` STEP FIVE: Create a classification model

``````from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=3000, verbose=False).fit(X_train, y_train)``````

STEP SIX: Now, let’s make some predictions and see how our model performed

``````#Use the model to generate predictions
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

from sklearn.metrics import f1_score, recall_score, accuracy_score

f1_score = round(f1_score(y_test, y_test_pred), 2)
recall_score = round(recall_score(y_test, y_test_pred), 2)
accuracy_score = accuracy_score(y_true, y_pred)

print("Sensitivity/Recall for Logistic Regression Model 1 : {recall_score}".format(recall_score = recall_score))

print("F1 Score for Logistic Regression Model 1 : {f1_score}".format(f1_score = f1_score))

print("Accuracy Score for Logistic Regression Model 1 : {accuracy_score}".format(accuracy_score = accuracy_score))`````` If you look at the accuracy score, as suspected the model is overfitting the data and showing 99% accuracy because of the data imbalance. But look at the F1-score, showing the actual performance of the model. This could be the benchmark for us, when we continue to run the model with a new set of data and retrain it.

## How Can You Monitor and Improve Your Model Performance with F1 Score?

Monitoring your model performance is an integral part of MLOps. The work doesn’t end with creating a model and predicting its performance. When the model goes live in production, it needs to be continuously monitored. But how can you do it?

One way to monitor your model is on Arize, diagnosing any issues as they come. Arize provides production ML analytics and workflows to quickly catch model and data issues, diagnose the root cause, and continuously improve performance for your products and business.

Let’s have a look. Here is how can you deploy the model and set monitors on Arize app:

STEP ONE: Setup Arize client

``````!pip install arize -q
from arize.pandas.logger import Client

SPACE_KEY = '36a4e28'
API_KEY = 'f8cd8f45c45d8299558'
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)``````

STEP TWO: Group and combine all the data (features/predictions/actuals) for each environment (train/validation/test) into one pd.DataFrame object.

``````import uuid

# df for training env
train_df = X_train.reset_index(drop=True)
train_df["prediction_label"] = y_train_pred
train_df["actual_label"] = list(y_train)
train_df["prediction_id"] = [str(uuid.uuid4()) for _ in range(len(y_train))]

# df for validation env
val_df = X_val.reset_index(drop=True)
val_df["prediction_label"] = y_val_pred
val_df["actual_label"] = list(y_val)
val_df["prediction_id"] = [str(uuid.uuid4()) for _ in range(len(y_val))]

# df for production env
test_df = X_test.reset_index(drop=True)
test_df["prediction_label"] = y_test_pred
test_df["actual_label"] = list(y_test)
test_df["prediction_id"] = [str(uuid.uuid4()) for _ in range(len(y_test))]``````

STEP THREE: Ingest training, validation and testing inferences into Arize

``````from arize.utils.types import Environments, ModelTypes, Schema, Metrics
import uuid

model_id = 'cc_fraud_prediction_ns'
model_version = '1.0'

train_schema = Schema(
prediction_id_column_name="prediction_id",
prediction_label_column_name="prediction_label",
actual_label_column_name="actual_label",
feature_column_names=train_df.columns.drop(
["prediction_id", "prediction_label", "actual_label"]),)

train_res = arize_client.log(
dataframe=train_df,
model_id=model_id,
model_version=model_version,
model_type=ModelTypes.BINARY_CLASSIFICATION,
environment=Environments.TRAINING,
schema=train_schema,
)
if train_res.status_code != 200:
print(f"❌ future failed with response code {train_res.status_code}, {train_res.text}")
else:
print(f"✅ future completed with response code {train_res.status_code}")

val_schema = Schema(
prediction_id_column_name="prediction_id",
prediction_label_column_name="prediction_label",
actual_label_column_name="actual_label",
feature_column_names=train_df.columns.drop(
["prediction_id", "prediction_label", "actual_label"]),)

val_res = arize_client.log(
dataframe=test_df,
model_id=model_id,
model_version=model_version,
model_type=ModelTypes.BINARY_CLASSIFICATION,
environment=Environments.VALIDATION,
batch_id="validation_CC_model",
schema=val_schema,
)

if val_res.status_code != 200:
print(f"❌ future failed with response code {val_res.status_code}, {val_res.text}")
else:
print(f"✅ future completed with response code {val_res.status_code}")``````
``````test_schema = Schema(
prediction_id_column_name="prediction_id",
prediction_label_column_name="prediction_label",
actual_label_column_name="actual_label",
feature_column_names=train_df.columns.drop(
["prediction_id", "prediction_label", "actual_label"]),)

test_res = arize_client.log(
dataframe=test_df,
model_id=model_id,
model_version=model_version,
model_type=ModelTypes.BINARY_CLASSIFICATION,
environment=Environments.PRODUCTION,
schema=val_schema,
)
if test_res.status_code != 200:
print(f"❌ future failed with response code {test_res.status_code}, {test_res.text}")
else:
print(f"✅ future completed with response code {test_res.status_code}")``````

STEP FOUR: The ingested data will be available within 10 minutes in the Arize UI STEP FIVE: Set the monitors on F1 Score and some features The current value for F1-score is around 70% and the threshold is set to 50%. At any point, if F1-score goes lower than 50%, Arize will send an alert notification.

Imagine your F1-score decreased by 40% and you got an alert. Here you will have to start troubleshooting prediction and feature drift and try to find what exactly is causing issues.

Once the root cause is identified – maybe a particular feature’s value needs to be normalized as data distribution is skewed – we can perform the necessary steps and retrain the model again. Once done, your model is ready to redeploy.

To recap: in this section, we saw how we can create our own classifier. Due to imbalanced data, accuracy was showing incorrect performance of the model – but F1 Score provided a better model metric for this use case. In order to continue monitoring the performance of the model, we then set monitors up on Arize.

## Summary

In the real world, we hardly get balanced data and most of the time we have to train our models using imbalanced data. There are ways to deal with imbalanced data (e.g. SMOTE), but how can you make sure that those techniques are also working? In practice, we often need F1-score as it takes recall and precision both into account.

In this article we got familiarized with F1-score, how it is calculated and applications where it is useful or not. We also saw how to create a model, predict the values and find the performance of that model using different performance metrics and monitor it.