All You Need Is Conformal Prediction
We must know how certain a model is when it makes predictions as there is a risk associated with wrong predictions. Without quantifying the model's uncertainty, an accurate prediction and a wild guess look the same. For example, a self-driving car must be certain that the driving path is free from obstacles. I have written about this in another article.
But how can we quantify the uncertainty of our model?
This is where Conformal Prediction comes into the picture. Conformal Prediction is a framework for uncertainty quantification. The approach can turn any point prediction into statistically valid prediction regions. The region can be a set of classes for classification or an interval for regression problems.
How do we turn a point forecast into a prediction region using Conformal Prediction?
Conformal Prediction uses past experience to determine the Uncertainty of new predictions. To apply Conformal Prediction we need a non-conformity score, a significance level alpha, and a calibration set.
The non-conformity score turns a heuristic uncertainty score, such as the soft-max score in classification, into a rigorous score. Rigorous means that the output has probabilistic guarantees of covering the true label. Turning a heuristic score into a rigorous score is called calibration.
During inference, we use the non-conformity score to determine our prediction regions. The non-conformity score has a strong impact on the results. The score decides the quality of our prediction regions. Hence, choosing the right non-conformity score is an important design choice.
The significance level restricts the frequency of errors in the Conformal Prediction algorithm. Choosing a significance level alpha of 0.1 means that our true value may lie outside our prediction set at most 10 % of the time.
Based on the significance level and the calibration set, we determine a threshold of the non-conformity score. During inference, this threshold decides which scores will be part of the prediction region. To get reliable results, we must ensure that the model has not seen the calibration set during training.
The above explanation might sound a bit vague. Let's make it more concrete and run through an example.
The intuition behind Conformal Prediction
Let's say we have a dataset of 10,000 animal pictures and we want to classify the animal on each picture. However, classifying each picture is not enough. We also want to be sure that the actual picture shows the animal we predicted with a probability of 90 %. To guarantee the coverage, we will use Conformal Prediction.
Hence, we need a non-conformity score, a significance level alpha, and a calibration set.
For the non-conformity score, we need a heuristic score. As we have a multi-class classification problem, we will use the soft-max score of the model. Many models return this score as a "probability." We turn this "probability" into a non-conformity score by taking 1 – soft-max score.
For the significance level alpha, we choose 0.1 as we want coverage of 90 %. The coverage is equal to 1 – alpha.
Usually, we would split the dataset into a train and test set. However, as we need to calibrate our model to get valid prediction sets, we also need a calibration set. Hence, we end up with three sets.
- A training set, which we use to train our model
- A calibration set, which we use to calibrate our model
- A test set, which we use to test the model and prediction sets
I will skip the entire training part of the model as this is unimportant for now. Let's assume we have a trained multi-class classification model.
We take the trained model and predict all samples in our calibration set. Note, that the model must not have seen this data during training.
For example, we show the model a picture of a cow and receive a soft-max score of 0.9 for the picture being a cow. For a picture of a chicken, we get a score of 0.7, and for a harder picture of a cat, we get 0.4.
In this step, we only evaluate the soft-max values of the model for the true class. We do not care about wrong predictions. We then turn the soft-max values into our non-conformity score, in this case, 1 – soft-max score of the true class.
Then, we sort the non-conformity scores. The lower the value, the more certain the model is about its prediction. We will end up with a distribution of our model's uncertainty.

We can use this distribution to define a threshold of the non-conformity score. The threshold divides the model into being certain and uncertain. Where this threshold is placed depends on the confidence level alpha. In our case, alpha is 0.1 as we want a guaranteed coverage of 90 % (1 – alpha).
Hence, we want 90 % of the values below the threshold and 10 % above the threshold. So, we need to compute the 0.9 quantile of our distribution. The quantile indicates that at least 90 % of samples have a true score above the quantile's value.

Once we have determined the threshold our model is calibrated.
Let's predict samples from the test set. We show the model two pictures: A dog and a whale. This time, we compute the non-conformity score for all classes. All classes with a score lower than our threshold will be part of the prediction set. That's it. It is this simple.

Now you might wonder, what happens if the underlying model and heuristic score are bad? How can Conformal Prediction guarantee coverage if we calibrate our model using only the non-conformity score for the true class?
In the above example, we only control the coverage but not the efficiency/size of the prediction set. We only guarantee that the prediction set will contain the true label. If the underlying model is bad, we could end up with all possible labels in our prediction set. Resulting in meaningless prediction sets.
Hence, we should not only care about coverage but also about the efficiency of our prediction sets. Efficiency means that the prediction region is as small as possible.
How can we do this?
As we can see, the quality of our prediction set mainly depends on the score function. The score contains almost all the information about our underlying model and data. Moreover, all our decisions are based on the score.
Hence, we need to choose a non-conformity score that tells us when our model is wrong to reach a high efficiency. A non-conformity score that not only considers the true class. For example, we could use the cumulative sum of all soft-max scores until we reach the true label.
Hence, choosing the right non-conformity score is an important decision in Conformal Prediction.
As you saw, applying Conformal Prediction is simple and easy to understand. But there are more advantages I want to show you.
Advantages
The main advantage of Conformal Prediction is that it guarantees marginal coverage. The prediction region will contain the true label with a certain probability over all data points. Many other approaches lack such guarantees.
Conformal Prediction is model agnostic. We can use Conformal Prediction for any underlying model. We only calibrate a point forecast. However, the better the underlying prediction, the better the prediction region. Here, the saying garbage-in/garbage-out is also valid.
Besides being model agnostic, Conformal Prediction is also distribution-free. It works with any dataset. We do not need to know any prior probabilities (unlike for Bayesian learning) or assume any distribution. Conformal Prediction only assumes that data points are exchangeable (a slightly weaker assumption than i.i.d.). This means that the training and calibration data belong to the same distribution.
Thus, Conformal Prediction has a broad application. We can apply the approach to regression, classification, or time series forecasting tasks. We can even use it to turn other uncertainty methods into trustworthy tools, such as multi-class probabilities, and Bayesian posteriors.
The approach is simple to implement and easy to use as the intuition behind Conformal Prediction is easy to understand. We do not need to retrain our model to calibrate its point forecast. Hence, we can easily wrap Conformal Prediction around existing models.
A recipe with many implementations
When you get started with Conformal Prediction, you will see that there are many algorithms. It can be overwhelming.
However, all algorithms follow the same steps. Once you understand the intuition behind Conformal Prediction understanding different algorithms is easy.
There are five steps:
1. Choose significance level/coverage and non-conformity score
Choose a significance level alpha that defines the guaranteed coverage. Also, choose a heuristic and non-conformity score.
2. Split dataset
We need a dataset the model has not seen during training to calibrate the model's predictions.
3. Train model
Train the model on the training subset of your dataset.
4. Calibrate model
Calibrate the model on the calibration subset using your chosen Conformal Prediction algorithm. For this, calculate the heuristic scores and make them rigorous through calibration.
5. Predict on unseen data
Make predictions on unseen data and use a calibrated threshold from step 4 to determine the prediction region.
Conformal Prediction algorithms can differ in detail. They might choose a different approach to splitting the dataset or a different non-conformity score. However, they still follow the five steps above.
In this article, I have shown you why you should prefer Conformal Prediction over other uncertainty quantification methods. I showed you the intuition behind Conformal Prediction, giving you the foundation to understand any Conformal Prediction approach for any problem.
However, I have only touched on the surface. There is much more to learn. I will dive deeper into Conformal Prediction in future articles. In the meantime, please feel free to comment or check out the paper "A Gentle Introduction to Conformal Prediction." Otherwise, see you in my next article.