Data Leakage in Preprocessing, Explained: A Visual Guide with Code Examples
DATA PREPROCESSING
⛳️ More [Data Preprocessing](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: · [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) · [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) · [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) ▶ [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)
In my experience teaching machine learning, students often come to me with this same problem: "My model was performing great – over 90% accuracy! But when I submitted it for testing on the hidden dataset, it is not as good now. What went wrong?" This situation almost always points to data leakage.
Data leakage happens when information from test data sneaks (or leaks) into your training data during data preparation steps. This often happens during routine data processing tasks without you noticing it. When this happens, the model learns from test data it wasn't supposed to see, making the test results misleading.
Let's look at common preprocessing steps and see exactly what happens when data leaks— hopefully, you can avoid these "pipeline issues" in your own projects.

Definition
Data leakage is a common problem in Machine Learning that occurs when data that's not supposed to be seen by a model (like test data or future data) is accidentally used to train the model. This can lead to the model overfitting and not performing well on new, unseen data.
Now, let's focus on data leakage during the following data preprocessing steps. Further, we'll also see these steps with specific scikit-learn
preprocessing method names and we will see the code examples at the very end of this article.

Missing Value Imputation
When working with real data, you often run into missing values. Rather than removing these incomplete data points, we can fill them in with reasonable estimates. This helps us keep more data for analysis.
Simple ways to fill missing values include:
- Using
SimpleImputer(strategy='mean')
orSimpleImputer(strategy='median')
to fill with the average or middle value from that column - Using
KNNImputer()
to look at similar data points and use their values - Using
SimpleImputer(strategy='ffill')
orSimpleImputer(strategy='bfill')
to fill with the value that comes before or after in the data - Using
SimpleImputer(strategy='constant', fill_value=value)
to replace all missing spots with the same number or text
This process is called imputation, and while it's useful, we need to be careful about how we calculate these replacement values to avoid data leakage.
Data Leakage Case: Simple Imputation (Mean)
When you fill missing values using the mean from all your data, the mean value itself contains information from both training and test sets. This combined mean value is different from what you would get using just the training data. Since this different mean value goes into your training data, your model learns from test data information it wasn't supposed to see. To summarize: