Another (Conformal) Way to Predict Probability Distributions

Author:Murphy | View: 26151 | Time: 2025-03-23 19:31:13

In a previous article, we explored the capabilities of Catboost's multi-quantile loss function, which allows for the prediction of multiple quantiles using a single model. This approach elegantly overcomes one of the limitations of traditional quantile regression, which necessitates the development of a separate model for each quantile, or storing the entire training set in the model. However, there is another disadvantage to quantile regression, which we will discuss in this article: the potential for predicted quantiles to be biased, leaving no guarantees of calibration and coverage. This article will demonstrate a way to overcome this with conformal multi-quantile regression. I would encourage anyone who hasn't been following this series to refer back to the following articles before reading:

A New Way to Predict Probability Distributions

Understanding Noisy Data and Uncertainty in Machine Learning

How To Predict Risk-Proportional Intervals With "Conformal Quantile Regression"

Recap: Why Multi-Quantile Regression?

Multi-quantile regression enables us to use a single moded to predict multiple target quantiles. Because there is no computational constraint necessitating one model per quantile, or the limitation of storing the entire training set in the model (e.g. KNN, Quantile Regression Forests), we can efficiently predict more quantiles and get a better feel for how the conditional target distribution looks.

Using traditional quantile regression, generating a 95% prediction interval would require one model for the 2.5th quantile, one for the 97.5th quantile, and possibly a third for the expected value or the 50th quantile. A single prediction from each model would look something like this:

Predicted CDF samples for a single test example (three independent quantile models). Image by Author.

Assuming these quantiles are calibrated, they reveal a few insights. The first is the probability that the target is less than or equal to 3.6, given the features, is around 0.50 or 50%. Similarly, the probability that the target value is between 3.25 and 4.38, given the features, is roughly 0.95 or 95%.

While the models' output is good and precisely what we required, we may want to adjust our risk tolerance dynamically. For instance, what if we need to be more conservative and require a 99% prediction interval? Similarly, what if we are more risk-seeking and can tolerate a 90% or 80% prediction interval? What if we want answers to questions like "given the features, what is the probability that the target is greater than y1?". We might also want to ask questions like "given the features, what is the probability that the target is between y1 and y2?". Multi-quantile regression facilities answering these questions by predicting as many quantiles as specified:

Predicted CDF samples for a single test example (one multi-quantile model). Image by Author.

The more quantiles that can be accurately predicted, the more the risk tolerance can be adjusted on the fly, and the more we can answer general probability questions about the conditional target distribution.

Note that single decision tree models have been used to generate multiple quantile predictions. However, this relies on the trees storing all target values in the leaf nodes. At prediction time, a quantile is specified and computed empirically from the data in the leaf node, requiring the model to store the entire training set. This also means deep trees could have very few examples to work with in the leaf nodes.

Catboost is fundamentally different because it only stores the number of specified quantiles in the terminal nodes. Moreover, the loss function is optimized to predict each specified quantile. We also enjoy the performance gains Catboost offers with its underlying architecture.

The Problem with Quantile Regression

In traditional and multi-quantile regression, there isn't always a statistical guarantee that quantiles are unbiased. This means, for a model trained to predict the 95th quantile of the target distribution, there's no guarantee that 95% of observations will actually be less than or equal to the prediction. This is problematic in high-risk applications where accurate probability representations are required to make critical decisions.

Quantile regression can also produce prediction intervals that are too conservative and subsequently uninformative. In general, prediction intervals should be as narrow as possible while maintaining the desired coverage level.

Conformal Multi-Quantile Regression

The idea behind conformal quantile regression is to adjust predicted quantiles to accurately reflect the desired risk tolerance and interval length. This is accomplished through a "calibration" step that computes "conformity scores" to correct the predicted quantiles. More details about conformal quantile regression can be found in this paper and this article. For conformal multi-quantile regression, we will utilize the following theorem:

Left and Right Tail Conformal Quantile Regression. Source.

Don't worry if this seems overly abstract, the steps are actually straight forward:

Create a training, calibration, and testing set. Fit the multi-quantile model on the training set to predict all quantiles of interest.
Make predictions on the calibration set. For each calibration instance and predicted quantile, compute the difference between the predicted quantile and the corresponding target value. These are the conformity scores.
For each testing example and predicted quantile (say q), subtract the 1-q quantile of the conformity scores corresponding to quantile q from the predicted quantile of the model. These are the new predicted quantiles.

We can implement this logic in a python class:

import numpy as np
import pandas as pd
from catboost import CatBoostRegressor, CatBoostError
from typing import Iterable

class ConformalMultiQuantile(CatBoostRegressor):

    def __init__(self, quantiles:Iterable[float], *args, **kwargs):

        """
        Initialize a ConformalMultiQuantile object.

        Parameters
        ----------
        quantiles : Iterable[float]
            The list of quantiles to use in multi-quantile regression.
        *args
            Variable length argument list.
        **kwargs
            Arbitrary keyword arguments.
        """

        kwargs['loss_function'] = self.create_loss_function_str(quantiles)
        super().__init__(*args, **kwargs)
        self.quantiles = quantiles
        self.calibration_adjustments = None

    @staticmethod
    def create_loss_function_str(quantiles:Iterable[float]):

        """
        Format the quantiles as a string for Catboost

        Paramters
        ---------
        quantiles : Union[float, List[float]]
            A float or list of float quantiles

        Returns
        -------
        The loss function definition for multi-quantile regression
        """

        quantile_str = str(quantiles).replace('[','').replace(']','')

        return f'MultiQuantile:alpha={quantile_str}'

    def calibrate(self, x_cal, y_cal):

        """
        Calibrate the multi-quantile model

        Paramters
        ---------
        x_cal : ndarray
            Calibration inputs
        y_cal : ndarray
            Calibration target
        """

        # Ensure the model is fitted
        if not self.is_fitted():

            raise CatBoostError('There is no trained model to use calibrate(). Use fit() to train model. Then use this method.')

        # Make predictions on the calibration set
        uncalibrated_preds = self.predict(x_cal)

        # Compute the difference between the uncalibrated predicted quantiles and the target
        conformity_scores = uncalibrated_preds - np.array(y_cal).reshape(-1, 1)

        # Store the 1-q quantile of the conformity scores
        self.calibration_adjustments = 
            np.array([np.quantile(conformity_scores[:,i], 1-q) for i,q in enumerate(self.quantiles)])

    def predict(self, data, prediction_type=None, ntree_start=0, ntree_end=0, thread_count=-1, verbose=None, task_type="CPU"):

        """
        Predict using the trained model.

        Parameters
        ----------
        data : pandas.DataFrame or numpy.ndarray
            Data to make predictions on
        prediction_type : str, optional
            Type of prediction result, by default None
        ntree_start : int, optional
            Number of trees to start prediction from, by default 0
        ntree_end : int, optional
            Number of trees to end prediction at, by default 0
        thread_count : int, optional
            Number of parallel threads to use, by default -1
        verbose : bool or int, optional
            Verbosity, by default None
        task_type : str, optional
            Type of task, by default "CPU"

        Returns
        -------
        numpy.ndarray
            The predicted values for the input data.
        """

        preds = super().predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose, task_type)

        # Adjust the predicted quantiles according to the quantiles of the
        # conformity scores
        if self.calibration_adjustments is not None:

            preds = preds - self.calibration_adjustments

        return preds

Example: The Superconductivity Dataset

We'll implement conformal multi-quantile regression on the superconductivity dataset available on the UCI Machine Learning Repository. This dataset provides 21,263 instances of 81 superconductor features with their critical temperature (the target). The data is split so that ~64% is allocated for training, ~16% for calibration, and 20% for testing.

# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor, CatBoostError
from sklearn.model_selection import train_test_split
from typing import Iterable
pd.set_option('display.max.columns', None)
sns.set()

# Read in superconductivity dataset
data = pd.read_csv('train.csv')

# Predicting critical temperature
target = 'critical_temp'

# 80/20 train/test split
x_train, x_test, y_train, y_test = train_test_split(data.drop(target, axis=1), data[target], test_size=0.20)

# Hold out 20% of the training data for calibration
x_train, x_cal, y_train, y_cal = train_test_split(x_train, y_train, test_size=0.20)

print("Training shape:", x_train.shape) # Training shape: (13608, 81)
print("Calibration shape:", x_cal.shape) # Calibration shape: (3402, 81)
print("Testing shape:", x_test.shape) # Testing shape: (4253, 81)

We'll specify a set of quantiles to predict. To illustrate the power of multi-quantile regression, the model will predict 200 quantiles from 0.005 to 0.99 – this is probably a bit excessive in practice. Next, we'll fit the conformal multi-quantile model, make uncalibrated predictions, calibrate the model on the calibration set, and make calibrated predictions.

# Store quantiles 0.005 through 0.99 in a list
quantiles = [q/200 for q in range(1, 200)]

# Instantiate the conformal multi-quantile model
conformal_model = ConformalMultiQuantile(iterations=100,
                                        quantiles=quantiles,
                                        verbose=10)

# Fit the conformal multi-quantile model
conformal_model.fit(x_train, y_train)

# Get predictions before calibration
preds_uncalibrated = conformal_model.predict(x_test)
preds_uncalibrated = pd.DataFrame(preds_uncalibrated, columns=[f'pred_{q}' for q in quantiles])

# Calibrate the model
conformal_model.calibrate(x_cal, y_cal)

# Get calibrated predictions
preds_calibrated = conformal_model.predict(x_test)
preds_calibrated = pd.DataFrame(preds_calibrated, columns=[f'pred_{q}' for q in quantiles])

preds_calibrated.head()

The resulting predictions should look something like this:

First few predicted quantiles for the first 5 observations. Image by Author.

On the testing set, we can measure how well the uncalibrated and calibrated predictions align with the left-tail Probability they're intended to represent. For instance, if the quantiles are calibrated, 40% of target values should be less than or equal to predicted quantile 0.40, 90% of target values should be less than or equal to predicted quantiles 0.90, etc. The code below computes the mean absolute error (MAE) between the desired left-tail probability and the actual left-tail probability encompassed by the predicted quantiles:

# Initialize an empty DataFrame
comparison_df = pd.DataFrame()

# For each predicted quantile
for i, quantile in enumerate(quantiles):

    # Compute the proportion of testing observations that were less than or equal 
    # to the uncalibrated predicted quantile
    actual_prob_uncal = np.mean(y_test.values <= preds_uncalibrated[f'pred_{quantile}'])

    # Compute the proportion of testing observations that were less than or equal 
    # to the calibrated predicted quantile
    actual_prob_cal = np.mean(y_test.values <= preds_calibrated[f'pred_{quantile}'])

    comparison_df_curr = pd.DataFrame({
                                    'desired_probability':quantile,
                                    'actual_uncalibrated_probability':actual_prob_uncal,
                                    'actual_calibrated_probability':actual_prob_cal}, index=[i])

    comparison_df = pd.concat([comparison_df, comparison_df_curr])

comparison_df['abs_diff_uncal'] = (comparison_df['desired_probability'] - comparison_df['actual_uncalibrated_probability']).abs()
comparison_df['abs_diff_cal'] = (comparison_df['desired_probability'] - comparison_df['actual_calibrated_probability']).abs()

print("Uncalibrated quantile MAE:", comparison_df['abs_diff_uncal'].mean()) 
print("Calibrated quantile MAE:", comparison_df['abs_diff_cal'].mean()) 

# Uncalibrated quantile MAE: 0.02572999018133225
# Calibrated quantile MAE: 0.007850550660662823

The uncalibrated quantiles were off by about 0.026 and the calibrated quantiles by 0.008, on average. Hence, the calibrated quantiles were more aligned with the desired left-tail probabilities.

Actual vs Predicted Quantile Left-Tail Probabilities. Image by Author.

This may not seem like a dramatic difference in calibration, however, the error in the uncalibrated model is made more clear by analyzing actual v.s. desired coverage:

Python">coverage_df = pd.DataFrame()

for i, alpha in enumerate(np.arange(0.01, 0.41, 0.01)):

    lower_quantile = round(alpha/2, 3)
    upper_quantile = round(1 - alpha/2, 3)

    # Compare actual to expected coverage for both models
    lower_prob_uncal = comparison_df[comparison_df['desired_probability'] == lower_quantile]['actual_uncalibrated_probability'].values[0]
    upper_prob_uncal = comparison_df[comparison_df['desired_probability'] == upper_quantile]['actual_uncalibrated_probability'].values[0]

    lower_prob_cal = comparison_df[comparison_df['desired_probability'] == lower_quantile]['actual_calibrated_probability'].values[0]
    upper_prob_cal = comparison_df[comparison_df['desired_probability'] == upper_quantile]['actual_calibrated_probability'].values[0]

    coverage_df_curr = pd.DataFrame({'desired_coverage':1-alpha,
                                    'actual_uncalibrated_coverage':upper_prob_uncal - lower_prob_uncal,
                                    'actual_calibrated_coverage':upper_prob_cal - lower_prob_cal}, index=[i])

    coverage_df = pd.concat([coverage_df, coverage_df_curr])

coverage_df['abs_diff_uncal'] = (coverage_df['desired_coverage'] - coverage_df['actual_uncalibrated_coverage']).abs()
coverage_df['abs_diff_cal'] = (coverage_df['desired_coverage'] - coverage_df['actual_calibrated_coverage']).abs()

print("Uncalibrated Coverage MAE:", coverage_df['abs_diff_uncal'].mean()) 
print("Calibrated Coverage MAE:", coverage_df['abs_diff_cal'].mean()) 

# Uncalibrated Coverage MAE: 0.03660674817775689
# Calibrated Coverage MAE: 0.003543616270867622

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(coverage_df['desired_coverage'],
        coverage_df['desired_coverage'],
        label='Perfect Calibration')
ax.scatter(coverage_df['desired_coverage'],
           coverage_df['actual_uncalibrated_coverage'],
           color='orange',
           label='Uncalibrated Model')
ax.scatter(coverage_df['desired_coverage'],
           coverage_df['actual_calibrated_coverage'],
           color='green',
           label='Calibrated Model')

ax.set_xlabel('Desired Coverage')
ax.set_ylabel('Actual Coverage')
ax.set_title('Desired vs Actual Coverage')
ax.legend()
plt.show()

Actual vs Desired Coverage. Image by Author.

The uncalibrated model tends to be too conservative and covers more examples than desired. The calibrated model, on the other hand, exhibits near-perfect alignment with each of the desired coverages.

Moreover, the average length of the prediction intervals generated by the calibrated model is less than that of the uncalibrated model. Thus, not only is coverage better in the calibrated model, the prediction intervals are more informative.

Average Prediction Interval Length by Desired Coverage. Image by Author.

One might ask what happens if we allow the uncalibrated model to include the calibration set as training data. This makes sense in practice because we wouldn't throw away good training data for no reason. Here are the results:

# Fit a model using the training and calibration data
regular_model = ConformalMultiQuantile(iterations=100,
                                        quantiles=quantiles,
                                        verbose=10)

regular_model.fit(pd.concat([x_train, x_cal]), pd.concat([y_train, y_cal]))

# Fit a model on the training data only
conformal_model = ConformalMultiQuantile(iterations=100,
                                        quantiles=quantiles,
                                        verbose=10)

conformal_model.fit(x_train, y_train)

# Get predictions before calibration
preds_uncalibrated = regular_model.predict(x_test)
preds_uncalibrated = pd.DataFrame(preds_uncalibrated, columns=[f'pred_{q}' for q in quantiles])

# Calibrate the model
conformal_model.calibrate(x_cal, y_cal)

# Get calibrated predictions
preds_calibrated = conformal_model.predict(x_test)
preds_calibrated = pd.DataFrame(preds_calibrated, columns=[f'pred_{q}' for q in quantiles])

comparison_df = pd.DataFrame()

# Compare actual to predicted left-tailed probabilities
for i, quantile in enumerate(quantiles):

    actual_prob_uncal = np.mean(y_test.values <= preds_uncalibrated[f'pred_{quantile}'])
    actual_prob_cal = np.mean(y_test.values <= preds_calibrated[f'pred_{quantile}'])

    comparison_df_curr = pd.DataFrame({
                                    'desired_probability':quantile,
                                    'actual_uncalibrated_probability':actual_prob_uncal,
                                    'actual_calibrated_probability':actual_prob_cal}, index=[i])

    comparison_df = pd.concat([comparison_df, comparison_df_curr])

comparison_df['abs_diff_uncal'] = (comparison_df['desired_probability'] - comparison_df['actual_uncalibrated_probability']).abs()
comparison_df['abs_diff_cal'] = (comparison_df['desired_probability'] - comparison_df['actual_calibrated_probability']).abs()

print("Uncalibrated quantile MAE:", comparison_df['abs_diff_uncal'].mean()) 
print("Calibrated quantile MAE:", comparison_df['abs_diff_cal'].mean()) 

# Uncalibrated quantile MAE: 0.023452756375340143
# Calibrated quantile MAE: 0.0061827359227361834

Even with less training data than the uncalibrated model, the calibrated model outputs better quantiles. What's more, the models perform similarly when we compare the expected values of the predicted quantiles to the target values:

from sklearn.metrics import r2_score, mean_absolute_error

print(f"Uncalibrated R2 Score: {r2_score(y_test, preds_uncalibrated.mean(axis=1))}")
print(f"Calibrated R2 Score: {r2_score(y_test, preds_calibrated.mean(axis=1))} n")

print(f"Uncalibrated MAE: {mean_absolute_error(y_test, preds_uncalibrated.mean(axis=1))}")
print(f"Calibrated MAE: {mean_absolute_error(y_test, preds_calibrated.mean(axis=1))} n")

# Uncalibrated R2 Score: 0.8060126144892599
# Calibrated R2 Score: 0.8053382438575666 

# Uncalibrated MAE: 10.622258046774979
# Calibrated MAE: 10.557269513856014

Final Thoughts

There are no silver bullets in machine learning, and conformal quantile regression is no exception. The glue that holds conformal prediction theory together is the assumption that the underlying data is exchangeable. If, for instance, the distribution of the data drifts over time (which is usually the case in many real world applications), then conformal prediction can no longer make strong probability guarantees. There are ways around this assumption, but these methods ultimately depend on the severity of data drift and the nature of the learning problem. It may also be less than ideal to set aside valuable training data for calibration.

As always, the machine learning practitioner is responsible for understanding the nature of the data and applying appropriate techniques. Thanks for reading!

Become a Member: https://harrisonfhoffman.medium.com/membership