Machine Learning for Regression with Imbalanced Data

Author:Murphy | View: 21309 | Time: 2025-03-23 13:05:53

What is imbalanced data?

Many real-world datasets suffer from imbalance, where certain types of samples are overrepresented in the dataset, while others occur less often. Some examples are:

When classifying credit card transactions as fraudulent or legitimate, the vast majority of transactions will belong to the latter category
Severe rainfall occurs less often then moderate rainfall, but may cause more damage to humans and infrastructure
When trying to identify land use, there are more pixels that represent forests and agriculture than urban settlements

In this post, we aim to give an intuitive explanation for why machine learning algorithms struggle with imbalanced data, show you how to quantify the performance of your algorithm using quantile evaluation, and show you three different strategies to improve your algorithm's performance.

Example dataset for regression: California housing

Dataset imbalance is often illustrated for classification problems, where a majority class overshadows a minority class. Here, we focus on regression, where the target is a continuous numerical value. We are going to use the California Housing Dataset that is available with scikit-learn. The dataset contains more than 20,000 samples of houses with features such as the location, number of rooms and bedrooms, house age, square footage, and median neighbourhood income. The target variable is the median house value, measured in millions of US-$. In order to see if the dataset is imbalanced, we plot the histogram of the target variable.

Histogram of the target variable in the California housing dataset. The red line shows the mean of the distribution at 1.9 M$.

Clearly, not every median house value is equally represented. The mean of the target variable is 1.9 M$, and the standard deviation is 0.98 M$, but the values do not follow a normal distribution – the distribution is skewed towards expensive houses beyond 4.0 M$ median value.

Mean squared error loss function

We implement a small neural network in Keras and use that to predict the median house value. Since we are dealing with a regression problem, the mean squared error function is a good candidate for a loss function. For a given batch of samples, the mean squared error (MSE) is calculated as

In this formula, the distance of the predicted and the actual value is the determining factor for the loss.

The model trains quickly and the loss curves look reasonable. The final loss reported for the training set is 0.2562, and for the validation set 0.2584, such that we seem to have hit the sweet spot of the bias variance tradeoff.

Evaluation using quantiles of the target variable

Overall, our Machine Learning algorithm produces a mean squared error of 0.27 on the hold out test set. But how does this compare to different housing prices? We divide the target variable into bins corresponding to 1 M$ price range each and calculate the associated mean squared errors for samples in each bin separately.

Mean squared error for different values of the target variable in the California housing dataset.

As we can see, our machine learning algorithm performs best on samples that are close to the mean value of the target variable. In the highest bin, with house values exceeding 4 M$, the error is almost 10 times higher in comparison!

Quantile evaluation is a great way of exploring how a machine learning algorithm performed on different regions of the dataset. This quick analysis can point directly to problems in the experimental setup, and you should always consider it before reporting only performance metrics that are averaged across the whole dataset.

Why does the model struggle to predict high house prices?

Shortcut learning

We cannot even blame the machine learning algorithm, since it performed exactly what we asked it to do: For the majority of samples, it holds good predictive power. It is just that houses exceeding a price of 4 M$ are underrepresented in the dataset – only 4% of the training data fall in that range – so there is little incentive for the algorithm to prioritize these samples.

We should always keep in mind that machine learning algorithms are learning the task we pose them. They are prone to short-cut learning, which is defined according to Geirhos et al (2021):

Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike.

Asking the right questions

So, let's say we are developing the algorithm for a high end realtor who is interested mostly in estimating the median value of valuable houses. For this client, the current algorithm would not deliver the desired predictive power, since it performs badly for the kind of houses they are interested in.

Focus on the most expensive homes instead of performing great on the average sample. Photo by Daniel Barnes on Unsplash

Working with imbalanced data

Strategy 1: Increase the batch size

When increasing the batch size, it is more likely that every batch seen during training will contain samples from the underrepresented group. We set the batch size to 512 and repeat model training.

Strategy 2: Introduce weights in the loss function

Here, we ask the machine learning algorithm to focus on samples that are underrepresented during training. These samples are associated with a higher weight. Weights can be calculated directly and passed on to the function model.fit(..., sample_weights=...) in Keras.

Explaining the calculation of weights in words:

Count the occurrence of samples per bin
Divide by the total number of samples – you obtain the frequency of the samples for the given bin
The inverse is the associated weight

And in code:

Strategy 3: Cast target values to normal distribution

In this scenario, we transform the target variable such that it follows a normal distribution. The normal distribution is best suited for the mean squared error loss and reduces the characteristics of the outliers. Important – do not forget to rescale your predicted values before evaluation!

Evaluation

Which strategy performed best?

Now the time has come to compare the three different strategies. We tracked the mean squared error per target variable bin for each strategy, as summarized in the figure below. When focusing only on the highest bin, weighted loss performs best. While it reduces the algorithm performance at average housing prices, for the region we are most interested in, it adds predictive power. The other two strategies, increasing the batch size and scaling the target variable, even increase the mean squared error in the highest bin. Therefore, they have not proven themselves helpful to solve our problem.

Mean squared error per bin of the target variable for each of the strategies outlined in the text.

We would therefore recommend our client, the high-end realtor, a machine learning algorithm that uses weights to put emphasis on samples that are underrepresented in the dataset. Note, however, that machine learning is an empirical science, and for another dataset, you may find a different solution to the imbalanced dataset problem.

Tags: Data Science Keras Machine Learning Programming Python