How to Handle Imbalanced Datasets in Machine Learning Projects

Author:Murphy | View: 21361 | Time: 2025-03-22 20:12:45

Imagine that you've trained a predictive model with an accuracy score as high as 0.9. The evaluation metrics like precision, recall and f1-score also appear promising. But your experience and intuition told you that something isn't right so you did further investigation and found this:

The model's seemingly strong performance is driven by the majority class 0 in its target variable. Due to the evident imbalance between the majority and minority classes, the model excels at predicting its majority class 0 while the performance of the minority class 1 is far from satisfactory. However, because class 1 represents a very small portion of the target variable, its performance has little impact on the overall scores of these evaluation metrics, which gives you an illusion that the model is strong.

This is not a rare case. On the contrary, data scientists frequently come across imbalanced datasets in the real-world projects. An imbalanced dataset refers to a dataset where the classes or categories are not represented equally. We shouldn't ignore the imbalance in datasets because it can lead to the problems of biased model performance, poor generalisation and misleading evaluation metrics.

This article will discuss the techniques to address the challenges brought by imbalanced datasets. For demonstration purposes, I'll continue using the Bank Marketing dataset from the UCI Machine Learning Repository which I used in my another article. You can check all information about the datasets and download data here. This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which allows for sharing and adaptation for any purpose, provided that appropriate credit is given.

The Bank Marketing dataset contains 16 features and one binary target variable, which indicates whether the client subscribed to a term deposit. The target variable is highly imbalanced, with its majority class 0 representing 88.3% of the total data while the minority class 1 making up 11.7%.

If we don't handle the imbalance in the dataset here, we can simply use the scripts below to train a predictive model to predict whether a client will subscribe to a term deposit and the evaluation metrics are shown in Image_1 at the beginning of this article.

Python"># Import libraries
import pandas as pd
import numpy as np
import io
import requests
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Download the data
url = "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"
response = requests.get(url)

# Open the dataset
with ZipFile(io.BytesIO(response.content)) as outer_zip:
    with outer_zip.open('bank.zip') as inner_zip_file:
        with ZipFile(io.BytesIO(inner_zip_file.read())) as inner_zip:
            with inner_zip.open('bank-full.csv') as csv_file:
                df = pd.read_csv(csv_file, sep=';')

# Initial EDA: 
# Check for missing values and basic stats
print(df.isnull().sum())  # No missing values in this dataset

# Drop columns 'day' and 'month' 
df = df.drop(columns=['day', 'month'])

# Loop One-Hot Encoding for categorical columns 
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 
                       'loan', 'contact', 'poutcome']

encoder = OneHotEncoder(drop='first', sparse=False)
for column in categorical_columns:
    encoded_cols = encoder.fit_transform(df[[column]])
    encoded_df = pd.DataFrame(encoded_cols, columns=[f"{column}_{cat}" for cat in encoder.categories_[0][1:]])
    df = pd.concat([df.drop(columns=[column]), encoded_df], axis=1)

# Separate features (X) and the target variable (y)
X = df.drop('y', axis=1)  # 'y' is the target variable
y = df['y'].apply(lambda x: 1 if x == 'yes' else 0)  # Convert target to binary

# Split the data into training and testing sets without using SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardizing numerical features
scaler = StandardScaler()
numerical_columns = X_train.select_dtypes(include=['int64', 'float64']).columns

for column in numerical_columns:
    X_train[column] = scaler.fit_transform(X_train[[column]])
    X_test[column] = scaler.transform(X_test[[column]])

# Train the RandomForestClassifier without any method to handle imbalance
rf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight=None)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)
y_pred_proba = rf.predict_proba(X_test)[:, 1]

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation results
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_rep)
print(f"ROC-AUC: {roc_auc}")

In the next sections, I'll introduce the most frequently used methods for handling imbalanced datasets and apply several suitable techniques to this Bank Marketing dataset.

Common Methods to Handle Imbalanced Datasets

Random UnderSampling

Random Undersampling is a method to remove samples from the majority class to balance the class distribution. It's often used when the majority class is significantly larger, and the developers can afford to lose some information due to the reduction of data.

Advantages: Simple and reduces training time.
Disadvantages: Potentially removes important information and lead to underfitting.

Python Example:

from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.datasets import make_classification

# Create a mock imbalanced dataset
X, y = make_classification(n_classes=2, weights=[0.99, 0.01], n_samples=1000, random_state=42)
print('Original class distribution:', Counter(y))

# Apply random undersampling
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print('Resampled class distribution:', Counter(y_res))

Random OverSampling

Opposite to Random Undersampling, Random Oversampling duplicates samples from the minority class to balance the dataset. It's usually used when data is limited, and the developers want to retain all samples while addressing imbalance.

Advantages: Keeps all original samples.
Disadvantages: Possibly lead to overfitting by repeating the same information.

Python Example:

from imblearn.over_sampling import RandomOverSampler

# Apply random oversampling
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print('Resampled class distribution:', Counter(y_res))

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates synthetic samples for the minority class by interpolating between existing samples. The key difference between SMOTE and Random OverSampling (ROS) is that ROS simply replicates data points without introducing any new information, which can lead to overfitting, while SMOTE generates new, synthetic samples and reduces the risk of overfitting compared to random duplication.

Advantages: More robust than random oversampling and less prone to overfitting.
Disadvantages: May introduce noise if the generated samples are not representative.

Python Example:

from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print('SMOTE class distribution:', Counter(y_smote))

Cost-Sensitive Learning

Cost-Sensitive Learning is a method that adjust model directly by assigning different costs to each class rather than introduce resampling techniques to alter the original dataset. This method is often used when the minority class is of greater importance (e.g., fraud detection, medical diagnoses).

Advantages: No need to modify the data.
Disadvantages: Requires careful tuning of cost parameters.

Python Example:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Train a cost-sensitive decision tree
model = DecisionTreeClassifier(class_weight={0: 1, 1: 10}, random_state=42)
model.fit(X, y)

# Evaluate the model
y_pred = model.predict(X)
print(classification_report(y, y_pred))

Balanced Random Forest

Balanced Random Forest is a method to combine random forests with balanced sampling of the classes. This method enables developers to build a robust model and avoid underfitting.

Advantages: Maintains model complexity while balancing the dataset.
Disadvantages: Computationally intensive.

Python Example:

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Balanced Random Forest model
brf = BalancedRandomForestClassifier(random_state=42)
brf.fit(X_train, y_train)

# Evaluate
y_pred = brf.predict(X_test)
print('Balanced Random Forest Accuracy:', accuracy_score(y_test, y_pred))

Addressing the Imbalance of Bank Marketing Data

In order to tackle the imbalance in the bank marketing data mentioned at the beginning of this article, I applied techniques including SMOTE, ADASYN, Balanced Random Forest (BRF) and Cost-Sensitive Learning. Then I selected BRF because it improved f1-score from 0.45 to 0.52 and the improvement is the most significant among all the methods. Additionally, BRF is an ensemble method that balances classes internally and this feature makes the method suitable for this Bank Marketing dataset. By using BRF, we don't need to worry too much about the issues of overfitting or underfitting as this is a robust method.

# Import libraries
import pandas as pd
import numpy as np
import io
import requests
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.ensemble import BalancedRandomForestClassifier  # Import BRF
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, classification_report

# Download the data
url = "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"
response = requests.get(url)

# Open the dataset
with ZipFile(io.BytesIO(response.content)) as outer_zip:
    with outer_zip.open('bank.zip') as inner_zip_file:
        with ZipFile(io.BytesIO(inner_zip_file.read())) as inner_zip:
            with inner_zip.open('bank-full.csv') as csv_file:
                df = pd.read_csv(csv_file, sep=';')

# Initial EDA: 
# Check for missing values and basic stats
print(df.isnull().sum())  # No missing values in this dataset

# Drop columns 'day' and 'month' 
df = df.drop(columns=['day', 'month'])

# Loop One-Hot Encoding for categorical columns 
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 
                       'loan', 'contact', 'poutcome']

encoder = OneHotEncoder(drop='first', sparse_output=False)
for column in categorical_columns:
    encoded_cols = encoder.fit_transform(df[[column]])
    encoded_df = pd.DataFrame(encoded_cols, columns=[f"{column}_{cat}" for cat in encoder.categories_[0][1:]])
    df = pd.concat([df.drop(columns=[column]), encoded_df], axis=1)

# Separate features (X) and the target variable (y)
X = df.drop('y', axis=1)  # 'y' is the target variable
y = df['y'].apply(lambda x: 1 if x == 'yes' else 0)  # Convert target to binary

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardizing numerical features
scaler = StandardScaler()
numerical_columns = X_train.select_dtypes(include=['int64', 'float64']).columns

X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

# Feature Selection using BalancedRandomForestClassifier
selector = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
selector.fit(X_train, y_train)
model = SelectFromModel(selector, threshold='median', prefit=True)
selected_mask = model.get_support()
selected_columns = X_train.columns[selected_mask]

X_train_selected = model.transform(X_train)
X_test_selected = model.transform(X_test)

# Visualize feature importance of the selected features
importances = selector.feature_importances_
selected_importances = importances[selected_mask]
indices = np.argsort(selected_importances)[::-1]
selected_names_sorted = [selected_columns[i] for i in indices]

plt.figure(figsize=(12, 8))
plt.title("Feature Importance of Selected Features")
plt.barh(range(len(selected_importances)), selected_importances[indices])
plt.yticks(range(len(selected_importances)), selected_names_sorted)
plt.xlabel('Relative Importance')
plt.gca().invert_yaxis()  
plt.show()

# Define parameter grid for BalancedRandomForest
n_estimators_options = [50, 100]
max_depth_options = [10, 20, 30]

best_f1_score = 0
best_accuracy = 0
best_params = {}
best_classification_report = ""
best_brf = None  

# Nested loop to iterate through hyperparameters
for n_estimators in n_estimators_options:
    for max_depth in max_depth_options:
        brf = BalancedRandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )
        brf.fit(X_train_selected, y_train)

        # Make predictions on the test set
        y_pred = brf.predict(X_test_selected)

        # Calculate performance metrics for the test set
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        # If current model has better F1-score, update best model details
        if f1 > best_f1_score or (f1 == best_f1_score and accuracy > best_accuracy):
            best_f1_score = f1
            best_accuracy = accuracy
            best_params = {'n_estimators': n_estimators, 'max_depth': max_depth}
            best_classification_report = classification_report(y_test, y_pred)
            best_brf = brf  # Store the best model

# Print the best model performance and hyperparameters
print(f"nBest F1 Score: {best_f1_score:.4f}")
print(f"Best Accuracy: {best_accuracy:.4f}")
print(f"Best Parameters: {best_params}")

# Print the classification report of the best model
print("nClassification Report for the Best Model:n")
print(best_classification_report)

# Check for overfitting 
y_train_pred = best_brf.predict(X_train_selected)

# Calculate metrics on the training set
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred, average='weighted')

print("nTraining Set Performance:")
print(f"Accuracy: {train_accuracy:.4f}")
print(f"F1 Score: {train_f1:.4f}")

print("nTest Set Performance:")
print(f"Accuracy: {best_accuracy:.4f}")
print(f"F1 Score: {best_f1_score:.4f}")

# Simple overfitting check
if train_accuracy > best_accuracy + 0.05:
    print("nOverfitting Detected: The model performs significantly better on the training set.")
else:
    print("nNo significant overfitting detected.")

The output is:

Conclusion

Is the choice a perfect solution? Although BRF improved the recall for the minority class from 0.36 to 0.86 and the F1-score from 0.45 to 0.52, we saw a decrease in precision. Does that mean the solution is unsuccessful? Not necessarily. The effectiveness of techniques for handling imbalanced datasets depends on the following factors:

Degree of Imbalance: The more severe the imbalance, the harder it can be for these methods to make significant improvements.
Model Adaptability: If models fail to capture all the underlying patterns, they might not fully benefit from the technique designed to handle imbalanced data.
Evaluation Metrics: Small gains in certain metrics can be considered significant depending on the specific application.

For this Bank Marketing dataset, improving the F1-score for the minority class from 0.45 to 0.52 is a significant enhancement. Since Balanced Random Forest ensures that each tree receives a balanced view of the data, it improves the model's ability to learn from the minority class. Although the increased false positives might lead to higher marketing costs, the significantly improved recall score can bring better conversion opportunities, enhanced campaign efficiency, and even greater customer engagement. Therefore, it's important to realize the trade-offs between precision and recall, and the improved f1-score can be justified in this specific case.

Tags: Data Science Imbalanced Data Machine Learning Python Tips And Tricks