Oversampling and Undersampling, Explained: A Visual Guide with Mini 2D Dataset

Author:Murphy | View: 20577 | Time: 2025-03-22 19:54:06

DATA PREPROCESSING

⛳️ More [Data Preprocessing](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: · [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) · [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) ▶ [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) · [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)

Collecting a dataset where each class has exactly the same number of class to predict can be a challenge. In reality, things are rarely perfectly balanced, and when you are making a classification model, this can be an issue. When a model is trained on such dataset, where one class has more examples than the other, it has usually become better at predicting the bigger groups and worse at predicting the smaller ones. To help with this issue, we can use tactics like oversampling and undersampling – creating more examples of the smaller group or removing some examples from the bigger group.

There are many different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links) out there but there doesn't seem to be many resources that visually compare how they work. So, here, we will use one simple 2D dataset to show the changes that occur in the data after applying those methods so we can see how different the output of each method is. You will see in the visuals that these various approaches give different solutions, and who knows, one might be suitable for your specific Machine Learning challenge!

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

Oversampling

Oversampling make a dataset more balanced when one group has a lot fewer examples than the other. The way it works is by making more copies of the examples from the smaller group. This helps the dataset represent both groups more equally.

Undersampling

On the other hand, Undersampling works by deleting some of the examples from the bigger group until it's almost the same in size to the smaller group. In the end, the dataset is smaller, sure, but both groups will have a more similar number of examples.

Hybrid Sampling

Combining oversampling and undersampling can be called "hybrid sampling". It increases the size of the smaller group by making more copies of its examples and also, it removes some of example of the bigger group by removing some of its examples. It tries to create a dataset that is more balanced – not too big and not too small.