Calibrating Classification Probabilities the Right Way

Author:Murphy | View: 28674 | Time: 2025-03-22 20:26:01

Venn-Abers predictors and its output for a Binary classifier (Image by the author).

In previous articles, I pointed out the importance of knowing how sure a model is about its predictions.

Uncertainty Quantification and Why You Should Care

For Classification problems, it is not helpful to only know the final class. We need more information to make well-informed decisions in downstream processes. A classification model that only outputs the final class covers important information. We do not know how sure the model is and how much we can trust its prediction.

How can we achieve more trust in the model?

Two approaches can give us more insight into classification problems.

We could turn our point prediction into a prediction set. The goal of the prediction set is to guarantee that it contains the true class with a given probability. The size of the prediction set then tells us how sure our model is about its prediction. The fewer classes the prediction set contains, the surer the model is.

But that does not tell us anything about the probability of a specific class being the true class. For example, we would like something such as "This picture shows a cat with 20% probability". These class probabilities give us more insight than a prediction set. We can use the probabilities to get a better view of the benefits and costs of the prediction.

Now, you could say, "Lucky us. Most ML models have a predict_proba method that does give us exactly what we want." But that is not true. The predict_proba method returns misleading values, at least if we care about the actual probabilities. We should not trust them because these "probabilities" are not calibrated. Calibrated means that the predicted probabilities reflect the real underlying probabilities. For example, we predict that a picture shows a cat with an 80 % probability. We expect that if our model predicts a cat with 80 % for ten pictures, eight show a cat.

But if predict_proba is not the correct method, what is it then?

Let me introduce you to Venn-ABERS predictors.

Venn-ABERS predictors are part of the Conformal Prediction family and thus can operate on top of any ML classifier. We can use them to turn the output of a scoring classifier into a calibrated probability.

How does it work?

Venn-ABERS predictors return an interval of the class probability instead of a single probability. For example, instead of saying "This picture shows a cat with 80% probability", we would say "This picture shows a cat with a probability between 75 and 85%."

For this, Venn-ABERS predictors fit two isotonic regressions to a calibration set, g_0 and g_1. We fit One isotonic regression, g_0, on the subset of data where the actual outcome is class 0. In the other isotonic regression, g_1, we fit on the subset of data where the actual outcome is class 1.

By using isotonic regression, we do not assume any function of probabilities. This gives us more freedom and a higher quality of the calibrated probabilities. For example, Platt scaling assumes a sigmoid function.

If you are not familiar with calibration sets, please check out my introduction article on Conformal Prediction.

Each regression maps a predicted classification score to a probability p_0 and p_1. p_0 states how likely the sample is class 0, while p1 states how likely it is that the sample is class 1. The probabilities p_0 and p_1 form an interval that guarantees to contain the correct class probability.

Venn-ABERS predictors inherit the validity guarantee inherited from Venn predictors. However, Venn-ABERS predictors are computationally more efficient and easier to use than Venn predictors as we only need to fit one model instead of many models to calibrate class probabilities. Although we only fit one model, Venn-ABERS predictors are as accurate and reliable as possible.

But why does Venn-ABERS fit two isotonic regressions?

Using two regressions separates the calibration process for positive and negative outcomes. Because each regression only focuses on one class, they can fit the data more accurately. Compared to a single isotonic regression, we reduce the chance of overfitting.

Fitting a regression to the positive and negative class ensures that we cover the true probability. No matter if the sample belongs to class 0 or class 1. The width of the interval shows us how sure the model is for an individual sample. The larger the difference between p_0 and p_1, the higher our uncertainty.

Usually, the interval is smaller for large data sets and larger for small or more challenging data sets. This is because the quality of p_0 and p_1 improves the more samples we have in our calibration set. With more samples, we can create more granular groups and derive a better representation of the true probability. However, there is a trade-off between the number of groups and the samples in each group. The fewer samples in one group, the less accurate our probability estimate will be because we are more prone to outliers in that group.

However, we might want a single-class probability and not a range to compare the probability with other approaches. We can transform p_0 and p_1 into a single probability by p = p_1 / (1 – p_0 + p_1) for logloss.

But why do we get calibrated probabilities by fitting two isotonic regressions?

Understanding how we fit an isotonic regression line to our data is essential to understanding Venn-ABERS predictors.

How does isotonic regression work?

In general, isotonic regression fits a monotonically increasing function. The fitted line can stay at the same level or go up. But never down. This behavior is fundamental.

As the function is non-decreasing it respects our original sorting of the predicted class scores. This ensures that our predicted scores and the true probabilities match up correctly.

The line maps any predicted score to the true probability. Here, the probability is the mean of the true labels for that given predicted score in our calibration set.

For this, we must understand how we fit the line to the data.

Let's assume we have a trained binary classifier and a calibration set. Our classifier did not see the calibration set during training. We collect the scores of the classifier outputs for each sample in the calibration set. We then sort these scores from lowest to highest. If we have the same scores for different samples, we group them. For each group, we determine how often the True class occurred. For this, we divide the number of samples with class 1 by the total number of samples with the same predicted score.

Predicted score of a binary classifier compared to the probability of the sample being the true class (Image by the author).

If our trained classifier is good then our predicted scores should align well with the probabilities of the true labels.

But probably there are cases in which the probability of the true class goes down even though the predicted scores increase. In this case, we group the samples of two adjacent scores and determine the true probability of the group. In other words, we take the average of the true probability of both predicted scores.

First step in fitting an isotonic regression line on the data points. Every time a data point with a higher predicted score has a lower probability the two groups are merged. The probability for the new group is determined by the average of the two groups. (Image by the author).

We continue doing this until the fitted line only stays at the same level or goes up. Essentially we smooth out the predictions to ensure they align better with the actual outcomes.

Merging groups continues until the fitted line only stays the same or increases (Image by the author).

The isotonic regression assigns a probability to each group of scores, reflecting the average of the true labels in that group. This ensures that if we map two predicted scores to the true probabilities, a higher score will lead to the same or a higher true probability. But never a lower one. Hence, the isotonic regression respects the sorting of the probabilities and scores.

Now that we understand the high-level theory behind Venn-ABERS predictors, let's use it.

How can we use it?

Luckily, we do not need to code the logic ourselves. Instead, we can get everything from the [venn-abers](https://github.com/ip200/venn-abers) library. The library contains an implementation of Venn-ABERS for binary and multiclass classification problems.

The library provides us with a VennAbersCalibrator class. The class contains all the needed methods to calibrate our class probabilities. However, we can use the class in two ways. We can wrap it around a scikit-learn algorithm or treat our classification algorithm and the Venn-Abers calibrator separately.

To show you both approaches, I will extend the example given in the library.

Let's begin with the first approach: Wrapping the Venn-ABERS predictor around a scikit algorithm.

Let's go through it in a bit more detail. First, we create a toy data set and split that into a training and test set. Then, we define the Venn-ABERS predictor. We pass in a scikit algorithm and choose the type of Venn-ABERS we want to use: inductive Venn-ABERS (IVAP) or cross Venn-ABERS (CVAP). The difference between IVAP and CVAP lies in how they run the calibration. IVAP uses a calibration set and fits the Venn-ABERS predictor once. CVAP, in contrast, uses cross-validation, fitting the predictor multiple times. The results of the validation sets are combined to create the final predictor.

In the example, I chose the IVAP with a calibration set of 20 % of the training set. To use CVAP, we can set the inductive parameter to False and add n_splits parameter, defining the number of splits for the cross-validation.

After defining the Venn-ABERS predictor, we can fit the predictor with the training set. Then, we derive the calibrated class probabilities on the test set using the predict_proba method.

The second approach is very similar. Instead of passing a scikit classifier when defining the Venn-ABERS predictor, we can handle both separately. For this, we create a calibration set and fit the classifier separately from the Venn-ABERS predictor.

Then, we use the predict_proba method from the VennAbersCalibrator class to calibrate the class probabilities. For this, we pass in the predicted scores of the classifier on the calibration set p_cal, the true values of the calibration set y_cal, and the predicted scores on the test set p_test.

If we are interested in the probability range of p_0 and p_1, we can set the p0_p1_output parameter to True.

Conclusion

In this article, I have shown you another approach to quantifying the uncertainty in classification problems. Instead of using prediction sets, we can calibrate the class probabilities.

If you stayed until here, you now should