How to Clean Your Data for Your Real-Life Data Science Projects

Author:Murphy | View: 24335 | Time: 2025-03-22 19:11:24

Data science made easy

We often hear – "Ohh, there are packages available to do everything! It takes only 10 mins to run the models using the packages." Yes, agreed there are packages – but they work only if you have a clean dataset ready to go with it. And how long does it take to create, curate, and clean a dataset from multiple sources that's fit for purpose? Ask a data scientist who is struggling to create one. All those who had to spend hours cleaning the data, researching, reading and re-writing codes, failing and re-writing again will agree with me! This brings us to the point:

‘Real-life data science is 70% data cleaning and 30% actual modeling or analysis'

Hence, I thought, let's go back to basics for a bit and learn about how to clean datasets and make them usable for solving business problems more efficiently. We will start this series with missing values treatment. Here is the agenda:

What are missing values
What are the causes of missing values in a dataset
Why are missing values important
Approach to deal with missing values
Guide in Python for missing value treatment – some examples with a real dataset

Let's get started…

1. What are missing values

Missing values are basically values of data or variables that are missing – which means if there is a variable say ‘Product line' that depicts the type of product like ‘Health or beauty' or ‘Sports and travel' etc then values of the variable ‘Product line' is missing might indicate certain transactions were not mapped to any particular product group/category.

Another example can be a variable like ‘income‘ that depicts the demographics of a customer might have values missing. This can be due to a particular customer not disclosing their income or it can also be that the particular customer does not have any income, like a Gen Z <18 years old.

As you can see there can be various reasons why certain values of a variable can be missing. This makes a nice transition to our next section on what are some of the causes or reasons for these missing values.

2. What are the causes of missing values in a dataset

There are mainly 3–4 causes that can lead to missing values in a dataset or how we can categorize the type of missing data.

a) MCAR (Missing completely at random): This means that a particular variable being missing is not dependent on other variables in the dataset i.e. it is independent of other variables. This does not introduce any bias in the data – but this rarely occurs.

E.g. say during data collection due to some technical glitch, information on a variable like ‘income‘ was missed out for some responders and hence some of the values become missing

b) MAR (Missing at Random): Here the variable missing is related to other variables in the dataset.

E.g. taking the same example of ‘income‘, for Gen Z (i.e. younger generation) ‘income‘ might be missing rather than the older generation because they might not be earning yet. So here income being missing is affected by another variable i.e. ‘age‘.

c) MNAR (Missing not at random): The missing values are not random but related to the value of that particular variable.

E.g. extending the ‘income‘ example – customers with high income are likely to skip the question on income resulting in a missing value.

There can be another cause – structured missing data—but we will park that discussion here for now. If interested, do let me know in the comments

Tags: Data Cleaning Data Science Handling Missing Values Missing Data Python Programming