Another (Conformal) Way to Predict Probability Distributions

In a previous article, we explored the capabilities of Catboost's multi-quantile loss function, which allows for the prediction of multiple quantiles using a single model. This approach elegantly overcomes one of the limitations of traditional quantile regression, which necessitates the development of a separate model for each quantile, or storing the entire training set in the model. However, there is another disadvantage to quantile regression, which we will discuss in this article: the potential for predicted quantiles to be biased, leaving no guarantees of calibration and coverage. This article will demonstrate a way to overcome this with conformal multi-quantile regression. I would encourage anyone who hasn't been following this series to refer back to the following articles before reading:
A New Way to Predict Probability Distributions
Understanding Noisy Data and Uncertainty in Machine Learning
How To Predict Risk-Proportional Intervals With "Conformal Quantile Regression"
Recap: Why Multi-Quantile Regression?
Multi-quantile regression enables us to use a single moded to predict multiple target quantiles. Because there is no computational constraint necessitating one model per quantile, or the limitation of storing the entire training set in the model (e.g. KNN, Quantile Regression Forests), we can efficiently predict more quantiles and get a better feel for how the conditional target distribution looks.
Using traditional quantile regression, generating a 95% prediction interval would require one model for the 2.5th quantile, one for the 97.5th quantile, and possibly a third for the expected value or the 50th quantile. A single prediction from each model would look something like this:

Assuming these quantiles are calibrated, they reveal a few insights. The first is the probability that the target is less than or equal to 3.6, given the features, is around 0.50 or 50%. Similarly, the probability that the target value is between 3.25 and 4.38, given the features, is roughly 0.95 or 95%.
While the models' output is good and precisely what we required, we may want to adjust our risk tolerance dynamically. For instance, what if we need to be more conservative and require a 99% prediction interval? Similarly, what if we are more risk-seeking and can tolerate a 90% or 80% prediction interval? What if we want answers to questions like "given the features, what is the probability that the target is greater than y1?". We might also want to ask questions like "given the features, what is the probability that the target is between y1 and y2?". Multi-quantile regression facilities answering these questions by predicting as many quantiles as specified:

The more quantiles that can be accurately predicted, the more the risk tolerance can be adjusted on the fly, and the more we can answer general probability questions about the conditional target distribution.
Note that single decision tree models have been used to generate multiple quantile predictions. However, this relies on the trees storing all target values in the leaf nodes. At prediction time, a quantile is specified and computed empirically from the data in the leaf node, requiring the model to store the entire training set. This also means deep trees could have very few examples to work with in the leaf nodes.
Catboost is fundamentally different because it only stores the number of specified quantiles in the terminal nodes. Moreover, the loss function is optimized to predict each specified quantile. We also enjoy the performance gains Catboost offers with its underlying architecture.
The Problem with Quantile Regression
In traditional and multi-quantile regression, there isn't always a statistical guarantee that quantiles are unbiased. This means, for a model trained to predict the 95th quantile of the target distribution, there's no guarantee that 95% of observations will actually be less than or equal to the prediction. This is problematic in high-risk applications where accurate probability representations are required to make critical decisions.
Quantile regression can also produce prediction intervals that are too conservative and subsequently uninformative. In general, prediction intervals should be as narrow as possible while maintaining the desired coverage level.
Conformal Multi-Quantile Regression
The idea behind conformal quantile regression is to adjust predicted quantiles to accurately reflect the desired risk tolerance and interval length. This is accomplished through a "calibration" step that computes "conformity scores" to correct the predicted quantiles. More details about conformal quantile regression can be found in this paper and this article. For conformal multi-quantile regression, we will utilize the following theorem:

Don't worry if this seems overly abstract, the steps are actually straight forward:
- Create a training, calibration, and testing set. Fit the multi-quantile model on the training set to predict all quantiles of interest.
- Make predictions on the calibration set. For each calibration instance and predicted quantile, compute the difference between the predicted quantile and the corresponding target value. These are the conformity scores.
- For each testing example and predicted quantile (say q), subtract the 1-q quantile of the conformity scores corresponding to quantile q from the predicted quantile of the model. These are the new predicted quantiles.
We can implement this logic in a python class:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor, CatBoostError
from typing import Iterable
class ConformalMultiQuantile(CatBoostRegressor):
def __init__(self, quantiles:Iterable[float], *args, **kwargs):
"""
Initialize a ConformalMultiQuantile object.
Parameters
----------
quantiles : Iterable[float]
The list of quantiles to use in multi-quantile regression.
*args
Variable length argument list.
**kwargs
Arbitrary keyword arguments.
"""
kwargs['loss_function'] = self.create_loss_function_str(quantiles)
super().__init__(*args, **kwargs)
self.quantiles = quantiles
self.calibration_adjustments = None
@staticmethod
def create_loss_function_str(quantiles:Iterable[float]):
"""
Format the quantiles as a string for Catboost
Paramters
---------
quantiles : Union[float, List[float]]
A float or list of float quantiles
Returns
-------
The loss function definition for multi-quantile regression
"""
quantile_str = str(quantiles).replace('[','').replace(']','')
return f'MultiQuantile:alpha={quantile_str}'
def calibrate(self, x_cal, y_cal):
"""
Calibrate the multi-quantile model
Paramters
---------
x_cal : ndarray
Calibration inputs
y_cal : ndarray
Calibration target
"""
# Ensure the model is fitted
if not self.is_fitted():
raise CatBoostError('There is no trained model to use calibrate(). Use fit() to train model. Then use this method.')
# Make predictions on the calibration set
uncalibrated_preds = self.predict(x_cal)
# Compute the difference between the uncalibrated predicted quantiles and the target
conformity_scores = uncalibrated_preds - np.array(y_cal).reshape(-1, 1)
# Store the 1-q quantile of the conformity scores
self.calibration_adjustments =
np.array([np.quantile(conformity_scores[:,i], 1-q) for i,q in enumerate(self.quantiles)])
def predict(self, data, prediction_type=None, ntree_start=0, ntree_end=0, thread_count=-1, verbose=None, task_type="CPU"):
"""
Predict using the trained model.
Parameters
----------
data : pandas.DataFrame or numpy.ndarray
Data to make predictions on
prediction_type : str, optional
Type of prediction result, by default None
ntree_start : int, optional
Number of trees to start prediction from, by default 0
ntree_end : int, optional
Number of trees to end prediction at, by default 0
thread_count : int, optional
Number of parallel threads to use, by default -1
verbose : bool or int, optional
Verbosity, by default None
task_type : str, optional
Type of task, by default "CPU"
Returns
-------
numpy.ndarray
The predicted values for the input data.
"""
preds = super().predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type)
# Adjust the predicted quantiles according to the quantiles of the
# conformity scores
if self.calibration_adjustments is not None:
preds = preds - self.calibration_adjustments
return preds
Example: The Superconductivity Dataset
We'll implement conformal multi-quantile regression on the superconductivity dataset available on the UCI Machine Learning Repository. This dataset provides 21,263 instances of 81 superconductor features with their critical temperature (the target). The data is split so that ~64% is allocated for training, ~16% for calibration, and 20% for testing.
# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor, CatBoostError
from sklearn.model_selection import train_test_split
from typing import Iterable
pd.set_option('display.max.columns', None)
sns.set()
# Read in superconductivity dataset
data = pd.read_csv('train.csv')
# Predicting critical temperature
target = 'critical_temp'
# 80/20 train/test split
x_train, x_test, y_train, y_test = train_test_split(data.drop(target, axis=1), data[target], test_size=0.20)
# Hold out 20% of the training data for calibration
x_train, x_cal, y_train, y_cal = train_test_split(x_train, y_train, test_size=0.20)
print("Training shape:", x_train.shape) # Training shape: (13608, 81)
print("Calibration shape:", x_cal.shape) # Calibration shape: (3402, 81)
print("Testing shape:", x_test.shape) # Testing shape: (4253, 81)
We'll specify a set of quantiles to predict. To illustrate the power of multi-quantile regression, the model will predict 200 quantiles from 0.005 to 0.99 – this is probably a bit excessive in practice. Next, we'll fit the conformal multi-quantile model, make uncalibrated predictions, calibrate the model on the calibration set, and make calibrated predictions.
# Store quantiles 0.005 through 0.99 in a list
quantiles = [q/200 for q in range(1, 200)]
# Instantiate the conformal multi-quantile model
conformal_model = ConformalMultiQuantile(iterations=100,
quantiles=quantiles,
verbose=10)
# Fit the conformal multi-quantile model
conformal_model.fit(x_train, y_train)
# Get predictions before calibration
preds_uncalibrated = conformal_model.predict(x_test)
preds_uncalibrated = pd.DataFrame(preds_uncalibrated, columns=[f'pred_{q}' for q in quantiles])
# Calibrate the model
conformal_model.calibrate(x_cal, y_cal)
# Get calibrated predictions
preds_calibrated = conformal_model.predict(x_test)
preds_calibrated = pd.DataFrame(preds_calibrated, columns=[f'pred_{q}' for q in quantiles])
preds_calibrated.head()
The resulting predictions should look something like this:

On the testing set, we can measure how well the uncalibrated and calibrated predictions align with the left-tail Probability they're intended to represent. For instance, if the quantiles are calibrated, 40% of target values should be less than or equal to predicted quantile 0.40, 90% of target values should be less than or equal to predicted quantiles 0.90, etc. The code below computes the mean absolute error (MAE) between the desired left-tail probability and the actual left-tail probability encompassed by the predicted quantiles:
# Initialize an empty DataFrame
comparison_df = pd.DataFrame()
# For each predicted quantile
for i, quantile in enumerate(quantiles):
# Compute the proportion of testing observations that were less than or equal
# to the uncalibrated predicted quantile
actual_prob_uncal = np.mean(y_test.values <= preds_uncalibrated[f'pred_{quantile}'])
# Compute the proportion of testing observations that were less than or equal
# to the calibrated predicted quantile
actual_prob_cal = np.mean(y_test.values <= preds_calibrated[f'pred_{quantile}'])
comparison_df_curr = pd.DataFrame({
'desired_probability':quantile,
'actual_uncalibrated_probability':actual_prob_uncal,
'actual_calibrated_probability':actual_prob_cal}, index=[i])
comparison_df = pd.concat([comparison_df, comparison_df_curr])
comparison_df['abs_diff_uncal'] = (comparison_df['desired_probability'] - comparison_df['actual_uncalibrated_probability']).abs()
comparison_df['abs_diff_cal'] = (comparison_df['desired_probability'] - comparison_df['actual_calibrated_probability']).abs()
print("Uncalibrated quantile MAE:", comparison_df['abs_diff_uncal'].mean())
print("Calibrated quantile MAE:", comparison_df['abs_diff_cal'].mean())
# Uncalibrated quantile MAE: 0.02572999018133225
# Calibrated quantile MAE: 0.007850550660662823
The uncalibrated quantiles were off by about 0.026 and the calibrated quantiles by 0.008, on average. Hence, the calibrated quantiles were more aligned with the desired left-tail probabilities.

This may not seem like a dramatic difference in calibration, however, the error in the uncalibrated model is made more clear by analyzing actual v.s. desired coverage:
Python">coverage_df = pd.DataFrame()
for i, alpha in enumerate(np.arange(0.01, 0.41, 0.01)):
lower_quantile = round(alpha/2, 3)
upper_quantile = round(1 - alpha/2, 3)
# Compare actual to expected coverage for both models
lower_prob_uncal = comparison_df[comparison_df['desired_probability'] == lower_quantile]['actual_uncalibrated_probability'].values[0]
upper_prob_uncal = comparison_df[comparison_df['desired_probability'] == upper_quantile]['actual_uncalibrated_probability'].values[0]
lower_prob_cal = comparison_df[comparison_df['desired_probability'] == lower_quantile]['actual_calibrated_probability'].values[0]
upper_prob_cal = comparison_df[comparison_df['desired_probability'] == upper_quantile]['actual_calibrated_probability'].values[0]
coverage_df_curr = pd.DataFrame({'desired_coverage':1-alpha,
'actual_uncalibrated_coverage':upper_prob_uncal - lower_prob_uncal,
'actual_calibrated_coverage':upper_prob_cal - lower_prob_cal}, index=[i])
coverage_df = pd.concat([coverage_df, coverage_df_curr])
coverage_df['abs_diff_uncal'] = (coverage_df['desired_coverage'] - coverage_df['actual_uncalibrated_coverage']).abs()
coverage_df['abs_diff_cal'] = (coverage_df['desired_coverage'] - coverage_df['actual_calibrated_coverage']).abs()
print("Uncalibrated Coverage MAE:", coverage_df['abs_diff_uncal'].mean())
print("Calibrated Coverage MAE:", coverage_df['abs_diff_cal'].mean())
# Uncalibrated Coverage MAE: 0.03660674817775689
# Calibrated Coverage MAE: 0.003543616270867622
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(coverage_df['desired_coverage'],
coverage_df['desired_coverage'],
label='Perfect Calibration')
ax.scatter(coverage_df['desired_coverage'],
coverage_df['actual_uncalibrated_coverage'],
color='orange',
label='Uncalibrated Model')
ax.scatter(coverage_df['desired_coverage'],
coverage_df['actual_calibrated_coverage'],
color='green',
label='Calibrated Model')
ax.set_xlabel('Desired Coverage')
ax.set_ylabel('Actual Coverage')
ax.set_title('Desired vs Actual Coverage')
ax.legend()
plt.show()

The uncalibrated model tends to be too conservative and covers more examples than desired. The calibrated model, on the other hand, exhibits near-perfect alignment with each of the desired coverages.
Moreover, the average length of the prediction intervals generated by the calibrated model is less than that of the uncalibrated model. Thus, not only is coverage better in the calibrated model, the prediction intervals are more informative.

One might ask what happens if we allow the uncalibrated model to include the calibration set as training data. This makes sense in practice because we wouldn't throw away good training data for no reason. Here are the results:
# Fit a model using the training and calibration data
regular_model = ConformalMultiQuantile(iterations=100,
quantiles=quantiles,
verbose=10)
regular_model.fit(pd.concat([x_train, x_cal]), pd.concat([y_train, y_cal]))
# Fit a model on the training data only
conformal_model = ConformalMultiQuantile(iterations=100,
quantiles=quantiles,
verbose=10)
conformal_model.fit(x_train, y_train)
# Get predictions before calibration
preds_uncalibrated = regular_model.predict(x_test)
preds_uncalibrated = pd.DataFrame(preds_uncalibrated, columns=[f'pred_{q}' for q in quantiles])
# Calibrate the model
conformal_model.calibrate(x_cal, y_cal)
# Get calibrated predictions
preds_calibrated = conformal_model.predict(x_test)
preds_calibrated = pd.DataFrame(preds_calibrated, columns=[f'pred_{q}' for q in quantiles])
comparison_df = pd.DataFrame()
# Compare actual to predicted left-tailed probabilities
for i, quantile in enumerate(quantiles):
actual_prob_uncal = np.mean(y_test.values <= preds_uncalibrated[f'pred_{quantile}'])
actual_prob_cal = np.mean(y_test.values <= preds_calibrated[f'pred_{quantile}'])
comparison_df_curr = pd.DataFrame({
'desired_probability':quantile,
'actual_uncalibrated_probability':actual_prob_uncal,
'actual_calibrated_probability':actual_prob_cal}, index=[i])
comparison_df = pd.concat([comparison_df, comparison_df_curr])
comparison_df['abs_diff_uncal'] = (comparison_df['desired_probability'] - comparison_df['actual_uncalibrated_probability']).abs()
comparison_df['abs_diff_cal'] = (comparison_df['desired_probability'] - comparison_df['actual_calibrated_probability']).abs()
print("Uncalibrated quantile MAE:", comparison_df['abs_diff_uncal'].mean())
print("Calibrated quantile MAE:", comparison_df['abs_diff_cal'].mean())
# Uncalibrated quantile MAE: 0.023452756375340143
# Calibrated quantile MAE: 0.0061827359227361834
Even with less training data than the uncalibrated model, the calibrated model outputs better quantiles. What's more, the models perform similarly when we compare the expected values of the predicted quantiles to the target values:
from sklearn.metrics import r2_score, mean_absolute_error
print(f"Uncalibrated R2 Score: {r2_score(y_test, preds_uncalibrated.mean(axis=1))}")
print(f"Calibrated R2 Score: {r2_score(y_test, preds_calibrated.mean(axis=1))} n")
print(f"Uncalibrated MAE: {mean_absolute_error(y_test, preds_uncalibrated.mean(axis=1))}")
print(f"Calibrated MAE: {mean_absolute_error(y_test, preds_calibrated.mean(axis=1))} n")
# Uncalibrated R2 Score: 0.8060126144892599
# Calibrated R2 Score: 0.8053382438575666
# Uncalibrated MAE: 10.622258046774979
# Calibrated MAE: 10.557269513856014
Final Thoughts
There are no silver bullets in machine learning, and conformal quantile regression is no exception. The glue that holds conformal prediction theory together is the assumption that the underlying data is exchangeable. If, for instance, the distribution of the data drifts over time (which is usually the case in many real world applications), then conformal prediction can no longer make strong probability guarantees. There are ways around this assumption, but these methods ultimately depend on the severity of data drift and the nature of the learning problem. It may also be less than ideal to set aside valuable training data for calibration.
As always, the machine learning practitioner is responsible for understanding the nature of the data and applying appropriate techniques. Thanks for reading!
Become a Member: https://harrisonfhoffman.medium.com/membership
References
- Catboost Loss Functions – https://catboost.ai/en/docs/concepts/loss-functions-regression
- Conformalized Quantile Regression – https://arxiv.org/pdf/1905.03222.pdf
- Conformal Prediction Beyond Exchangeability – https://arxiv.org/pdf/2202.13415.pdf
- The Superconductivity Dataset – https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data
- How to Predict Risk-Proportional Intervals with Conformal Quantile Regression – https://towardsdatascience.com/how-to-predict-risk-proportional-intervals-with-conformal-quantile-regression-175775840dc4
- How to Predict Full Probability Distributions using Machine Learning Conformal Prediction – https://valeman.medium.com/how-to-predict-full-probability-distribution-using-machine-learning-conformal-predictive-f8f4d805e420