Binary Cross Entropy: Where To Use Log Loss In Model Monitoring

amber roberts arize
Amber Roberts,  Machine Learning Engineer  | Published January 01, 2023

What Is Binary Cross Entropy?

Binary cross entropy (also known as logarithmic loss or log loss) is a model metric that tracks incorrect labeling of the data class by a model, penalizing the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values. Binary cross entropy is equal to -1*log(likelihood).

Logarithmic loss function math

Here Yi represents the actual class and log(p(yi)is the probability of that class.

  • p(yi) is the probability of one
  • 1-p(yi) is the probability of zero

When Is Log Loss Used In Model Monitoring?

Log loss can be used in training as the logistic regression cost function and in production as a performance metric for binary classification. This piece focuses on how to leverage log loss in a production setting.

Let’s say you are a machine learning engineer who built a click through rate (CTR) model to predict whether or not a user will click on a given product listing. You are making this a binary classification problem by stating that there are ultimately two possible outcomes:

  • User does click on product I 1
  • User does NOT click on product I 0

In order to predict whether or not a user will click on the product, you need to predict the probability that the user will click. For example, if you predict there is an 80% chance the user will click the product, then you are pretty confident the actual (or ground truth) will result in a “1” (user does click on product). However, if the user ends up not clicking on the product then you were likely too confident and your model may need to be adjusted. Using log loss is a good metric if you want to penalize your model for being overly confident.

From your CTR model, you have following data:

Table 1: Calculating log loss using a trained classifier for a balanced dataset

User ID Prediction  Probability  Actual  Predicted Probability of ‘Click’ Log ( Predicted Probability of ‘Click’ )
sd459 1 0.80 1 0.80 -0.22
br575 0 0.10 0 0.90 -0.11
sd325 1 0.65 1 0.65 -0.43
ef345 1 0.78 1 0.78 -0.25
bw678 1 0.91 1 0.91 -0.09
aq765 1 0.20 0 0.80 -0.22
df837 0 0.65 0 0.35 -1.05
lk948 1 0.87 1 0.87 -0.14
os274 0 0.22 0 0.78 -0.25
ye923 0 0.33 0 0.67 -0.40

Keep in mind:

  • Column 1: User ID. Used to identify individual users
  • Column 2: Prediction. Our model’s prediction on whether or not a user will click the product
  • Column 3: Probability. The certainty of that model’s prediction
  • Column 4: Actual. Whether or not the user did click on the product
  • Column 5: Predicted Probability of ‘Click.’ This is the probability that a user will click on the product listing. If your model predicted a 90% chance the User br575 wouldn’t click on the product, there is a 10% chance the user br575 would click. This column is also referred to as “corrected probabilities” and is mostly used to simplify the log loss equation.
  • Column 6: Log( Predicted Probability of ‘Click’). This is the log of the Column 5 values. The logarithm used is the natural logarithm (base-e).

The model is trained to classify whether or not a user will click on a given product. Column 3 is the probability that the user clicked on the product. The probability that user sd459 will click the product is 0.8 and the probability that user df837 will click the product is 0.65, but since user df837 didn’t click the product (actual = 0), the predicted probability of ‘click’ is 0.35.

Going back to the original formula, when the actual class = 1 then you are then left with the first term because the rest of the terms will be 0. This will give way to the second equation below, where now all that remains to do is to take the negative average of Column 6 values to compute the log loss.

logarithmic loss equation

log loss predicted probability graph

The figure above shows that the higher the prediction probability is to 1, the lower the log-loss value. In Table 1, the model’s best prediction was predicting User bw678 would click the product with 91% certainty (user did click) and the model’s worst prediction was predicting User df837 would click the product with 65% certainty (user did not click). The total log loss value, the negative average of Column 6, comes out to 0.316. Since this is a low log loss value, it likely indicates strong model performance.

What Are the Limitations of Using Log Loss?

Log loss is useful in that it penalizes your model when it is very certain and very uncertain. If your model predicts the user will click the product with low certainty and the user doesn’t click the product, the log won’t penalize this as much as if your model was very certain the user would click the product and then didn’t. However, the usefulness of log loss as a performance indicator depends on your data.

Imbalanced Data

In Column 4, five out of the ten users clicked on the product. It is worth noting that it is unrealistic that there would be more “click” actions than “no click” actions in a real dataset. Let’s say that instead of a CTR model, you had a fraud use case where you are trying to predict fraudulent activity. Predicting whether or not a transaction is fraud is historically difficult due to the class imbalance issues in fraud datasets. This issue arises because we are much more likely to have a non fraudulent transaction than a fraudulent one, therefore our negative “fraud” class is typically much smaller than our positive “not fraud” class.

Evaluating models that are trained on imbalanced datasets can be tricky due to the misinterpretation that often occurs. Log loss is no different. While it is better to use log loss than accuracy for imbalanced datasets since log loss takes into account the certainty of the prediction, it still is impacted by highly imbalanced datasets. When you want to determine how well a model is predicting, you need to compare the current model performance to baseline or  no-skill (or naïve) model performance on the dataset.

Looking at the example from above, assume that the dataset is perfectly balanced between users who clicked on products and users who haven’t. The naïve classifier will say that it is equally likely that a user will click a product, as a user who won’t click a product.

Table 2: Calculating log loss using a naive classifier for a balanced dataset

User ID Prediction Probability Actual Predicted Probability of ‘Click’ Log ( Predicted Probability of ‘Click’ )
sd459 1 0.50 1 0.50 -0.69
br575 0 0.50 0 0.50 -0.69
sd325 1 0.50 1 0.50 -0.69
ef345 1 0.50 1 0.50 -0.69
bw678 1 0.50 1 0.50 -0.69
aq765 0 0.50 0 0.50 -0.69
df837 0 0.50 0 0.50 -0.69
lk948 1 0.50 1 0.50 -0.69
os274 0 0.50 0 0.50 -0.69
ye923 0 0.50 0 0.50 -0.69

Ultimately, a performance metric is only informative if it helps you correctly interpret the results. If you calculate the baseline log loss for a balanced dataset, then you would want the calculated log loss to be higher than the baseline for the model to be considered a good classifier. Here, a baseline log loss of 0.693 tells you the log loss of your model performance at 0.316 is better – and that this model is better at determining whether or not a user will click than a coin flip.

What Can the Log Loss Score Say About the Strength of Our Classifier?

In order to see what the log loss value can tell us about our model, let’s change the distributions to reflect an imbalance dataset. Say one out of ten users clicked on a product, and we use a baseline classifier that predicts the probability of each user clicking a product as 0.1.

Table 3: Calculating Log Loss using a naive classifier for an imbalanced dataset

User ID Prediction Probability Actual Predicted Probability of ‘Click’ Log ( Predicted Probability of ‘Click’ )
sd459 0 0.10 1 0.10 -2.30
br575 0 0.10 0 0.90 -0.11
sd325 0 0.10 0 0.90 -0.11
ef345 0 0.10 0 0.90 -0.11
bw678 0 0.10 0 0.90 -0.11
aq765 0 0.10 0 0.90 -0.11
df837 0 0.10 0 0.90 -0.11
lk948 0 0.10 0 0.90 -0.11
os274 0 0.10 0 0.90 -0.11
ye923 0 0.10 0 0.90 -0.11

As shown above in Table 3, the log loss score of this naïve model is 0.325. Notice that no matter how many data points (users) you have, as long as the ratio is the same (1:10) then the log loss (taking the negative of the average of Column 6) will be 0.325. A baseline classifier of 0.325 would only be slightly better than the trained classifier of 0.316 calculated above if they have the same data distributions. This is why teams should evaluate the success of a model performance metric like log loss against a baseline to gain insights, as opposed to looking at the raw score alone.

Table 4: Calculating log loss using a trained classifier for an imbalanced dataset

User ID Prediction Probability Actual Predicted Probability of ‘Click’ Log ( Predicted Probability of ‘Click’ )
sd459 1 0.80 1 0.8 0.22
br575 0 0.10 0 0.9 0.11
sd325 0 0.20 0 0.8 0.22
ef345 0 0.10 0 0.90 0.11
bw678 0 0.15 0 0.85 0.16
aq765 0 0.22 0 0.78 0.25
df837 0 0.10 0 0.90 0.11
lk948 0 0.20 0 0.80 0.22
os274 0 0.15 0 0.85 0.16
ye923 0 0.12 0 0.88 0.13

For a highly imbalanced dataset, the baseline will be closer to zero because there is a small number of events that will have an impact on the log loss score. Predicting many low probabilities results in a low log loss value. Like all metrics, interpreting log loss needs to be compared to a baseline before a model developer can be convinced of a model’s decision-making capabilities. The log loss here for an imbalanced dataset is 0.169, which is better than the baseline of 0.325. This is all that can be said with certainty; the classifiers in Table 1 and Table 2 cannot be compared to the classifiers in Table 3 and Table 4 because the underlying data distribution has changed.

Distributions Matter

If you are comparing the performance of two models using log loss, be sure that you are comparing model versions with the same training data. Model A will be a better prediction model than Model B if the log loss is lower and the data distributions are the same.

Skewed Data

The above example takes the negative average of Column 4 values to arrive at the log loss. This is the correct step to take for normally distributed data, but what happens when you have a numeric feature in your dataset that tells a user’s distance to the nearest city? Since cities have the highest resident populations in a country, binning by equal distances would likely lead to a skewed distribution (with a higher number of users in the first several bins). For this case, taking the average can result in a higher log loss value than if you were to take the median or mode.

Partial Credit

If you are looking for a performance metric that is black and white on a binary prediction, log loss may not be an ideal metric. Log loss tells you that if your model predicts a 60% chance a user will click on a product and then they don’t, then that is worse than if your model predicts a 40% chance a user will click on a product and they don’t. Either way, your user didn’t click on a product.

What Is the Best Way To Leverage Log Loss?

For datasets with few clicks, conversions, or negative events, here are some techniques to help balance your dataset and make log loss more useful:

  • Downsampling: Undersampling the positive class
  • Upsampling: Oversampling the negative class
  • Weighted Sampling: Weighting the negative class so that it has a higher probability of being selected is based on its weight.

What If I Have a Multiclass Classification Problem?

You can still use log loss as a performance evaluation metric for a multiclass classification problem, you will just need to use this equation:

What If I Have An Unstructured Classification Problem?

While log loss is still a popular choice for unstructured use cases (i.e. an NLP sentiment classification model), it is advisable to use a prior for the negative class at the start of training in the case of heavy class imbalance to improve training stability.

Conclusion

Here’s a TL;DR for ML practitioners on log loss:

  • Fundamentals: Log loss tells you how close your predicted probability is to the corresponding actual desired outcome (i.e. “click”). If your model predicts a “click” with 100% certainty each time, for example, your log loss would have a perfect score of 0.0.
  • Interpret accurately: The more imbalanced your dataset, the lower a baseline log loss gets to 0.0; be sure to to interpret your log loss results accordingly.
  • Tread carefully with skewed data: Log loss takes the negative average of the log(corrected probabilities). If your dataset is skewed, it might be better to use a median value, re-examine your dataset, or select a different metric.
  • Know Thy Metric: Understand the purpose of your model and be sure the metric is representative of your goal. Ultimately, that is the most important factor when influencing which metric to use for model evaluation.