Creating SMOTE Oversampling from Scratch

Author:Murphy | View: 26774 | Time: 2025-03-22 19:09:44

Synthetic Minority Oversampling Technique (SMOTE) is commonly used to handle class imbalances in datasets. Suppose there are two classes and one class has far more samples (majority class) than the other (minority class). In that case, SMOTE will generate more synthetic samples in the minority class so that it's on par with the majority class.

In the real world, we're not going to have balanced datasets for classification problems. Take for example a classifier that predicts whether a patient has sickle cell disease. If a patient has abnormal hemoglobin levels (6–11 g/dL), then that's a strong predictor of sickle cell disease. If a patient has normal hemoglobin levels (12 mg/dL), then that predictor alone doesn't indicate whether the patient has sickle cell disease.

However, about 100,000 patients in the USA are diagnosed with sickle cell disease. There are currently 334.9 million US citizens. If we have a dataset of every US citizen and label or not the patient has sickle cell disease, we have 0.02% of people who have the disease. We have a major class imbalance. Our model can't pick up meaningful features to predict this anomaly.

Furthermore, our model would have a 99.98% accuracy if it predicted that all patients don't have sickle cell disease. This doesn't help solve our health problem, and also a limitation to using accuracy as the sole metric for evaluating model performance. Class imbalance is frequent in healthcare datasets, thus hampering the detection of rare diseases/events. Most disease classification methods implicitly assume an equal occurrence of classes, which helps maximize the overall classification accuracy.

While other metrics can be used instead of accuracy (precision and recall), we would also want to oversample the minority class (patients with rare diseases) to make sure we have similar numbers of data points for each class label.

How does SMOTE work?

However, SMOTE doesn't create exact copies of the minority samples. Rather, it combines two algorithms: K-Nearest Neighbor and linear interpolation.

Below is the pseudocode for SMOTE that leverages these two algorithms.

SMOTE randomly selects a point in the minority sample
SMOTE finds its k nearest neighbors in that minority sample
SMOTE randomly picks one of those k neighbors
SMOTE creates a linear equation with the randomly selected point and its neighbor. Then, it generates a synthetic sample along that linear equation.

Below is a graph of how SMOTE works.

You can see that the synthetic sample (in orange) is a random point on the linear equation that is formed between the data point of interest and the neighbor selected. The synthetic sample coordinates have the following constraints.

Xs is a value between Xk and Xi. In this example, we assume Xs, Xk, and Xi are all integers.
Ys is the result of the linear equation if you plug in Xs. Xs is then multiplied by the neighbor_slope, followed by adding the y-intercept (in this case, Yk).

Below is how to use the Imbalance-Learn SMOTE package.

SMOTE for Imbalanced Classification with Python

But how would we create it from scratch?

Initialize Dataset

So let's have a dataset of 110 points. Each point has an x and y coordinate and both values fall in the range of -5 to 5. These x and y coordinates are also integers.

10 of these points have label 1
100 of these points have label 0

The graph of this dataset is shown below.

Matplotlib graph of dataset. Custom generated graph by author.

If we were to train a classifier on this dataset, it would get a 90% accuracy if it predicted only 0 for all points. Therefore, we want to use SMOTE to oversample the minority class (label 1) so it can distinguish between the two labels.

We want to add 90 more synthetic data points with label 1 so that we have a dataset of 200 points, with 100 points for each label.

Steps

Select Random Point In Minority Sample

The first step of the pseudocode is to filter out all the data points in the minority sample and select one of those points randomly.

Below is a graph of all the data points in the minority sample.

So let's take the data point at (1, 4).

Find k nearest neighbors

The next step is to find k nearest neighbors for that data point. For now, we'll assume k=1.

So the nearest neighbor for datapoint (1,4) is (-2,2).

Randomly pick one of those k neighbors

We then randomly pick one of the k neighbors of that data point. Since we assumed k=1, we'll use (-2, 2).

Create Linear Equation

From that data point and its neighbor, we'll create a linear equation. This equation is y = (2 x + 10) / 3

Generate Synthetic Sample

With that given equation, randomly generate a synthetic sample on that line. So if we set x = 0, y = (2 *0 + 10)/3 = 10/3 = 3.33

Our synthetic sample is point (0, 3.33).

Repeat

So we created 1 synthetic data point. That leaves 89 more synthetic data points left.

So we repeat the above steps with different data points until we get 90 synthetic data points.

Code for SMOTE

Python">import numpy as np
from sklearn.neighbors import NearestNeighbors

def custom_smote(samples, n, k):
    '''
    n = total number of synthetic samples to generate
    k = number of heighbos
    '''
    synthetic_shape = (n, samples.shape[1]) 
    synthetic = np.empty(shape=synthetic_shape)
    synthetic_index = 0

    nbrs = NearestNeighbors(n_neighbors=k,metric='euclidean',algorithm='kd_tree').fit(samples)

    for synthetic_index in range(synthetic.shape[0]):
       max_samples_index = samples.shape[0]
       A_idx = np.random.randint(low=0, high=max_samples_index)
       A_point = samples[A_idx]
       distances,knn_indices = nbrs.kneighbors(X=[A_point], n_neighbors=(k+1))
       neighbor_array = knn_indices[0]

       if A_idx in neighbor_array:
          condition = np.where(neighbor_array == A_idx)
          neighbor_array = np.delete(neighbor_array, condition)

       len_neighbor_array = len(neighbor_array)
       if len_neighbor_array > 0:
          B_idx = np.random.randint(low=0, high=len_neighbor_array)
       else:
          B_idx = 0

       B_point = samples[neighbor_array[B_idx]]
       m = (B_point[1] - A_point[1])/(B_point[0] - A_point[0])

       high_point = A_point[0] if A_point[0] > B_point[0] else B_point[0]
       low_point = A_point[0] if A_point[0] < B_point[0] else B_point[0]

       if m == m:
          random_x = np.random.uniform(low=low_point, high=high_point)
          random_y = m * (random_x-A_point[0]) + A_point[1]
       else:
          random_x = B_point[0]
          random_y = B_point[1]

       synthetic[synthetic_index] = (random_x, random_y)

    return synthetic

Results for SMOTE

With 1 neighbor, we would generate the following 90 points of synthetic data with label=1

If we append our synthetic samples to our existing dataset, we can see how they map below.

We don't see a lot of variance from our synthetic samples. They're nicely grouped for label=1, as opposed to being spread out for label=0. This is because we chose to randomly select one neighbor from one nearest neighbor of the minority class data point.

If we increase the number of neighbors to randomly select from, we'll have more varied samples for the minority class.

You can see the samples being more distributed as we increase the number of neighbors to select from. Matplotlib graph of dataset. Custom generated graph by author.

Code to Generate Plots

from collections import Counter
import matplotlib.pyplot as plt
import os    

def summarize_data(regular, graph_title, filename):
    plt.figure(figsize=(8,6))
    x = regular[:, 0]
    y = regular[:, 1]
    plt.scatter(x, y)
    plt.title(graph_title)
    plt.savefig(filename)

def summarize_data_with_legend(X, y, graph_title, filename):
    plt.figure(figsize=(8,6))
    counter = Counter(y)
    for label, _ in counter.items():
        row_idx = np.where(y == label)[0]
        plt.scatter(X[row_idx, 0], X[row_idx, 1], label=str(label))
    plt.legend()
    plt.title(graph_title)
    plt.savefig(filename)

k=9

small_sample_label_1 = np.random.randint(-5, 5, (10,2))
small_sample_label_0 = np.random.randint(-5, 5, (100,2))
print(small_sample_label_1 )
file_path = os.path.dirname(__file__)
image_output_path = file_path + '/small_sample.jpg'

# Create plot of 10 label 1 points and 100 label 0 points
summarize_data(small_sample_label_1, "Original Sample With Label=1", file_path + '/original_sample.jpg')
X = np.concatenate((small_sample_label_0, small_sample_label_1), axis=0)
y = np.concatenate((np.zeros(100), np.ones(10)), axis=0)
summarize_data_with_legend(X, y,  "Original Sample with labels", file_path + '/original_with_labels.jpg')

# Create 90 synthetic samples wit label 1, and plot them in single plot
small_synthetic_label_1 = custom_smote(small_sample_label_1 , 90, k)
summarize_data(small_synthetic_label_1 , "SMOTE of Sample With Label=1, k=%i"%k, file_path + '/synthetic_sample_k_%i.jpg'%k)

# Join 90 synthetic samples with rest of dataset, and plot
X = np.concatenate((small_sample_label_0, small_sample_label_1, small_synthetic_label_1), axis=0)
y = np.concatenate((np.zeros(100), np.ones(100)), axis=0)
summarize_data_with_legend(X, y,  "SMOTE Sample with labels, k=%i"%k, file_path + '/SMOTE_with_labels_k_%i.jpg'%k)

Variations of SMOTE

Now that you have the code to implement SMOTE from scratch, there are other ways to tweak it to improve variance in sampling (besides increasing the number of neighbors to select from).

Use a radial kernel instead of a linear equation to generate synthetic samples.
Use an algorithm other than K Nearest Neighbors to select a point of comparison. For example, use K-Means to find clusters of the minority sample. From there, get the centroids of such clusters. Use the centroids as points of comparison with the selected data point (as opposed to an actual data point from K Nearest Neighbors).
Pick more than one K Nearest Neighbor. Instead of using a linear equation to generate synthetic samples, use a linear hyperplane of the selected datapoint and more than one of its K Nearest Neighbors.

This piece from KDNuggets below goes over 7 other SMOTE variations. They are all in Python libraries, but the descriptions should give you an idea of how to incorporate such changes in your own custom SMOTE.

https://www.kdnuggets.com/2023/01/7-smote-variations-oversampling.html

Conclusion

We went over an oversampling technique and its applications in the real world. We described how to implement this from scratch and provided Python logic. We then discussed other forms of SMOTE that can increase the variance of the minority sample, and how to implement such changes.

Thanks for reading! If you want to read more of my work, view my Table of Contents.

If you're not a Medium paid member, but are interested in subscribing just to read tutorials and articles like this, click here to enroll in a membership. Enrolling in this link means I get paid for referring you to Medium.

Tags: Data Science Machine Learning Oversampling Python Smote