2 Simple Steps To Reduce the Memory Usage of Your Pandas Dataframe

Author:Murphy  |  View: 21031  |  Time: 2025-03-23 19:15:56
"If I fits, I sits" – Fitting a large dataset into RAM (Image hand-drawn by the author)

Data analysts and scientists work with large datasets on a day-to-day basis. The dataset size can quickly become an issue if it consumes all the available memory or when you need to manipulate the data in memory.

The dataset size can quickly become an issue if it consumes all the available memory

In this scenario, we will look at two simple tricks you can apply when loading a large dataset from a CSV file to a Pandas dataframe to fit into your available RAM:

  1. Explore Which Columns Are Relevant For You
  2. Reduce Data Types (Downcasting)

Disclaimer: These steps can help reduce the amount of required memory of a dataframe, but they can't guarantee that your dataset will be small enough to fit in your RAM after applying them.

Step 1: Explore Which Columns Are Relevant for You

Seldomly you will need all columns of a dataset in your memory. That is why the first step is to review which columns are relevant for you.

Since the whole dataset does not fit in your RAM at this point, the simple trick is to load only the first few rows (e.g., nrows = 1000) of the dataset.

Python">import pandas as pd

df = pd.read_csv(file_path, 
                 nrows = 1000) 

The number of rows nrows should be

  • small enough to fit the dataframe in your RAM and
  • large enough to decide for each column if it is relevant for you or not.

Once you have evaluated which columns you need for your purposes, you can then proceed to load only those into the dataframe:

df = pd.read_csv(file_path, 
                 usecols = ["column_1", ...,  "column_n"]) 

Step 2: Reduce Data Types (Downcasting)

Since Pandas loads columns into the widest data type (e.g., integers as int64) by default, your initial dataframe might be larger than necessary. Thus, the second step is to evaluate whether you can reduce the data type to a narrower one.

For example, if you can reduce one column from an int64 (8 bytes) to an int8 (1 byte) in a dataframe with 1 million rows, you'll save 7 MB ((8–1) bytes*1,000,000 rows) – now imagine if you can do this for multiple columns.

For this step, it is essential to load the entire column to explore its minimum and maximum values. If the dataset with the reduced amount of columns from step 1 still does not fit into your memory, you can load subsets of columns for this step.


Before we dive in, here are some handy Pandas commands to evaluate the columns' data types and their memory usage:

Data types – You can use the df.info() method or the df.dtypes attribute to find each columns data type

Memory usage – To find how many bytes one column and the whole dataframe are using, you can use the following commands:

  • df.memory_usage(deep = True): How many bytes is each column?
  • df.memory_usage(deep = True).sum(): How many bytes is the whole dataframe?
  • df.info(memory_usage = "deep"): How many bytes is the whole dataframe?

Now, how do you choose the correct data type?

First, answer the question "What generic data type do you have?" – A boolean, an integer, a floating point number, or a string?

Boolean

If you have a boolean variable, it would be a shame to be using an 8 byte int64 to store zeros and ones.

Make sure that you are using one of the following two:

  • boolfor False or True
  • uint8 for 0 or 1

Both only take up 1 byte.

Integers

For integers, you need to look at the entire value range of the variable. Specifically, you need to determine the minimum and maximum values.

# Get the statistical properties of the full dataframe
df.describe()

# Or specifically get minimum and maximum values
df['column_1'].min()
df['column_1'].max()

If you are more of a visual type, you can alternatively plot the boxplot of that column:

import seaborn as sns

sns.boxplot(data = df['column_1']);

If the minimum value is larger or equal to 0, you will use an unsigned integer. Then select the narrowest data type according to the maximum value.

  • uint8: 1 byte unsigned integer (0 to 255)
  • uint16: 2 byte unsigned integer (0 to 65,535).
  • uint32: 4 byte unsigned integer (0 to 4,294,967,295).
  • uint64: 8 byte unsigned integer (0 to 18,446,744,073,709,551,615).

If the minimum value is smaller than 0, you will use a signed integer. Then select the narrowest data type according to the minimum and maximum value.

  • int8: 1 byte signed integer (-128 to 127)
  • int16: 2 byte signed integer (-32,768 to 32,767)
  • int32: 4 byte signed integer (-2,147,483,648 to 2,147,483,647)
  • int64: 8 byte signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)

Floating point numbers

Similarly to integers, you need to know the minimum and maximum values to determine the required width of a float data type. Additionally, you need to know the required precision points:

  • float16 (half-precision): 2 byte floating-point number (approximately -65,504 to 65,504 with approximately 4 decimal digits)
  • float32 (single-precision): 4 bytes floating-point number (approximately -3.4 x 10^38 to 3.4 x 10^38 with approximately 7 decimal digits)
  • float64 (double-precision): 8 bytes floating-point number (approximately -1.8 x 10^308 to 1.8 x 10^308with approximately 16 decimal digits)

Please note that the precision may be limited to fewer decimal digits due to the limited number of bits available to represent the number.

Strings and Objects

Since strings have variable lengths, they are stored as object in a dataframe by default. You could store them as str with a fixed maximum length:

# Convert object data type to string with a maximum length of 4 characters
df['column_1'] = df['column_1'].astype('|S4') 

However, a string of length N uses N bytes.

Thus, you might want to review if any of the following options apply:

Categorical variables – If you have a categorical variable, you could label encode or one hot encode the column.

Identifiers – A common type of string variable is an identifier. If applicable, you could experiment with converting the string to an integer.

For example, you could take the last 16 characters of the identifier string and convert it to an int64 as shown below. Just make sure that each identifier stays unique in this process.

df['column_id'] = df['column_id'].apply(lambda x: int(x[-16:], 16)).astype('int64')

In the process of downcasting or narrowing the data types, you need to be aware of the loss of precision for floating point numbers and overflow for integers.

For example, let's see what happens when we cast an uint16 with a = 256 to an uint8 (value range of 0 to 255).

The issue of an integer overflow when downcasting (Image by the author)

As you can see, the largest number that a uint8 can hold in 1 byte (or 8 bits) is 255. If you want to store the number 256, you need at least 9 bits. Since a uint8 will only look at the first 8 bits, the 9th bit will be cut off.

Thus, if you downcast an uint16 of a = 256 to an uint8, a = 0.


Now that you know which data type each relevant column should have, you can downcast them. Since a CSV file does not contain any information about the data types of its columns, you have to specify the data type for each column during loading.

df = pd.read_csv(file_path, 
                 usecols = ["column_1", ...,  "column_n"], 
                 dtype = {"column_1" : np.int8,
                          ...,
                          "column_n" : np.float8,
                         }
                 )

You can also change the data type of a column after a dataframe is loaded:

df['column_1'] = df['column_1'].astype(np.int8)

Summary

This article explained how you could reduce the memory size of a Pandas dataframe when loading a large dataset from a CSV file. We learned that reducing the number of columns and then downcasting their data types can significantly reduce the amount of required memory. By utilizing these techniques, we can ensure that we use the available RAM efficiently.

After you apply these tricks, you can also save the dataframe in an alternative file format to CSV that remembers the data types of the columns (e.g., Parquet or Feather).

How to Handle Large Datasets in Python

Additionally, to speed up processing on a large dataset, you could also have a look at the Pandas alternative Polars.


Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Find me on LinkedIn, Twitter, and Kaggle!

Tags: Artificial Intelligence Data Science Machine Learning Programming Python

Comment