What Is PR AUC?
In the latest edition of “The Slice,” a blog series from Arize that explains the essence of ML concepts in a digestible question-and-answer format, we dive into how to calculate PR AUC – and where it’s most useful. Want to see AUC in the Arize platform? Sign up for a free account.
AUC, short for area under the precision recall (PR) curve, is a common way to summarize a model’s overall performance. In a perfect classifier, PR AUC =1 because your model always correctly predicts the positive and negative classes. Since precision-recall curves do not consider true negatives, PR AUC is commonly used for heavily imbalanced datasets where you are optimizing for the positive class.
In this blog post, we will:
- Show you with visualizations how to calculate the best and worst cases of PR AUC
- Give real world examples on when PR AUC succeeds and when it fails
- Compare PR AUC to ROC AUC
- Tell you where you should choose PR AUC and why
How do you calculate PR-AUC?
To calculate the PR AUC, we need a dataset with two columns: prediction score and actual label (i.e. Table 1). Since the actual label is binary in this case, we use +1 and -1 to denote the positive and negative classes, respectively. The important thing to note here is that the actual labels are determined by the real world, whereas the prediction scores are determined by a model.
To see the AUC calculation clearly, we first visualize Table 1 as Figure 1, where each data point is plotted with its prediction score on the x-axis and the actual label on the y-axis. We do this to divide the data points into two groups identified by their actual labels, because the next step is to generate prediction labels and calculate the two quantities shown in Table 2, whose denominators are precisely the counts of data points in these two groups.
|X – axis||Recall / true positive rate (TPR)||TP / (TP + FN)||# True positives / # actual positives||Number of data points with positive actual labels|
|Y – axis||Precision / positive predictive value (PPV)||TP / (TP + FP)||# True positives / # predicted positive||Number of data points with positive prediction labels|
|Table 2. Quantities to calculate. Remember the denominators.|
To generate prediction labels, we optimize a threshold for the prediction score.
- Data points with prediction scores above the cutoff are given positive prediction labels.
- Data points with prediction scores below the cutoff are given negative prediction labels.
A PR curve is an enumeration of all such thresholds. Each point on the PR curve corresponds to one of two quantities in Table 2 that we can calculate based on each cutoff.
When is PR AUC optimal?
Let’s say that you are a machine learning engineer at a healthcare startup and you build a classifier to determine if a patient has a disease. A perfect classifier would be able to discriminate between people with and without the disease every time.
For a data set with 20 data points, the animation below demonstrates how a perfect PR curve is constructed. AUC is calculated as the area below the PR curve using interpolation methods.
Figure 2: Animation of the AUC calculation process when AUC = 1
If the two groups are perfectly separated by their prediction scores, then AUC = 1 and the model score is doing a perfect job correctly detecting positive samples.
A key takeaway here is that PR AUC measures the degree of separation between these two groups of data points – identified by their actual labels 0 and 1 – when their prediction scores are plotted on the x-axis. Table 3 summarizes how the movement on the PR curve corresponds to each data point’s actual label.
|When the cutoff moves past a data point with an actual label that is…||The PR curve moves…|
|Positive||Rightward / Upward|
|Table 2. Quantities to calculate. Remember the denominators.|
When does PR AUC fail?
We have seen that an ideal PR curve does a perfect job correctly detecting positive samples, however, if teams start misclassifying the positive class the curve will fail. Additionally since PR AUC doesn’t take into account the impact of True Negatives, it loses favor (to metrics like ROC AUC) if teams care about positive and negative classes equally. Going back to our example from earlier, if your disease classifier model cannot discriminate between people with and without the disease, then the classifier is as useful as a coin flip. If these two groups are perfectly commingled in a balanced dataset (equal numbers of the positive and negative classes), then PR AUC = 0.5 and the model scores are not doing a good job of correctly detecting positive samples.
Figure 3: Animation of the AUC calculation process when AUC = 0.5
This means that PR AUC = 0.5, which is the baseline (horizontal value equal to lowest value of precision) and your classifier would need a PR AUC > 0.5 to be a viable classifier. Note the PR AUC baseline will decrease as the imbalance between the two classes increases.
What is the difference between PR-AUC and ROC-AUC?
As mentioned earlier PR AUC tends to be used for heavily imbalanced datasets where you are optimizing for the positive class. If correctly detecting positive* samples is not your main focus, you should look at ROC AUC. This is due to the ROC AUC’s denominator, false positive rate (FPR), which includes True Negative values that drag down the FPR which lowers the ROC curve. PR AUC uses precision which emphasizes on the positive class, so you should use it when you care more about the positive classes of an imbalanced dataset.
*We assign the positive label to the minority class and the negative label to the majority class since we care most about the rare, interesting events.
For example, in our disease classifier, we want to use PR AUC because we care about:
- Precision – correctly predicting those with the disease from those without
- Recall – correctly predicting all those with the disease
- A class imbalance – a majority of folks in the dataset do not have a disease
The PR AUC score doesn’t incorporate True Negatives (people without the disease) and therefore is less likely to exaggerate model performance for unbalanced datasets. If you add more True Negatives to your dataset the ROC AUC will increase, giving a false representation on the skill of your classification model in unbalanced datasets.
For ML teams, the choice of which metric to use depends on the distribution of the dataset, the type of model being used, as well as the business use case. AUC metrics are a frequently-used as they give a general, visual overview of model performance. PR AUC is a metric to leverage if the goal of your model is to identify correctly as many minority events as possible. Take fraud detection for example, the goal is to identify all fraudulent activity. As always, it’s worth keeping in mind that every model metric has tradeoffs – additional context and knowing what business outcome you’re optimizing toward can make a big difference in understanding whether to use AUC or another metric to determine whether a model is underperforming. Lastly, if your PR AUC value is not significantly above the baseline, you might consider a technique to deal with your imbalanced data, such as upsampling the minority class or downsampling the majority class.
Questions? Please feel free to reach out in the Arize community!