Need for Speed: cuDF Pandas vs. Pandas

Author:Murphy  |  View: 20262  |  Time: 2025-03-22 22:11:11
Image by Author (Dalle-3)

What is cuDF Pandas?

If you're a user of the Pandas library in Python, and you want or need to maximise your program run times, then you have a few options available to you. Most of these options revolve around the use of external libraries that supplant existing Pandas operations and are optimised for data processing at scale and speed. Examples of these libraries are VAEX, POLARS, DuckDB and others. The issue with these is that in general, they require you to re-write your code to a greater or lesser extent which may not be something you want, or have the ability, to do.

If you are lucky enough to have a GPU on your system then another, more recent option which has become available is called Cudf.pandas.

cuDF.pandas is built upon cuDF, a Python GPU DataFrame library (based on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.

To use cuDF Pandas, you simply supply a flag if running Python from the command line or load an extension if running Python via a Jupyter Notebook. When Gpu computation is supported (e.g. there's an NVIDIA GPU available, and cuDF knows how to run the Pandas code), your code will run on the GPU. In cases where this is not possible, cuDF automatically switches to running on the CPU. You don't need to write two versions of your code, and you don't need to handle switching between GPU and CPU manually.

cuDF is part of RAPIDS, an open-source suite of software libraries and APIs created and maintained by NVIDIA, the pioneering company in GPU (Graphics Processing Unit) technology and the creator of CUDA (Compute Unified Device Architecture). RAPIDS is designed to enable data scientists and engineers to leverage the power of GPU acceleration for data exploration and analytics pipelines. It does this by using the parallel processing capability of GPUs and allows for significant performance improvements compared to traditional CPU-based data processing.

In this article, we'll pit regular Pandas against cuDF Pandas and see what all the fuss is about. We'll source a large input file, read it into a pandas dataframe and perform some manipulations on the dataframe. At each step, we'll time how long regular Pandas takes to do the operations versus cuDF pandas.

Pre-requisites

  • Install WSL Ubuntu Linux

As I use a Windows-based system, and Cuda installation on that platform is more complicated, I opted to do my installation under Linux. Luckily the Windows System for Linux or (WSL) comes in handy for that purpose.

To install it, open a Powershell command window, then you can simply type in:-

(base) PS C:Usersthoma> wsl --install

Installing: Windows Subsystem for Linux
Windows Subsystem for Linux has been installed.
Installing: Ubuntu
Ubuntu has been installed.
The requested operation is successful. Changes will not be effective until the system is rebooted.

Alternatively, you can download a suitable WSL from the Microsoft store (it's free!) and follow the installation instructions from there. Here's a link.

Ubuntu – Official app in the Microsoft Store

Next, reboot your PC and the WSL should kick in automatically and you will be asked to set up a username and password. If all goes OK, your command window should look like this:

Image by Author

To quit WSL Linux, type in exit at the prompt. After installation, to invoke Ubuntu again, at a regular Powershell command window, type in ubuntuto go back into a Linux shell.

  • Install the latest Nvidia drivers for your GPU and system

Go to the Nvidia website and install the latest drivers relevant to your system and GPU spec.

Download the latest official NVIDIA drivers

  • Install Miniconda on WSL

Once you have installed WSL and started it up, type in the following command to get Miniconda, and install it.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ ./Miniconda3-latest-Linux-x86_64.sh
  • Install Rapids

You can install a RAPIDS environment via conda, pip or docker file but we will be using conda. Click on the RAPIDS guide below for full installation instructions:-

Installation Guide – RAPIDS Docs

Near the top of the install guide, you will see a screen that looks like this,

Image from Nvidia RAPIDS Install guide

As you click on each of the boxes in the top half of the screen, the COMMAND output box at the bottom will change to reflect what you've chosen. So, click on each relevant box depending on your requirements. When you are done, copy the command contained in the COMMAND output box and type that into your WSL terminal. It will take a few minutes to complete. My COMMAND looked like this, but yours will likely be different.

conda create --solver=libmamba -n rapids-23.12 -c rapidsai -c conda-forge 
-c nvidia rapids=23.12 Python=3.10 cuda-version=12.0

Once it's done, we can activate our new environment using,

$ conda activate rapids-23.12
  • Install Jupyter

To run our speed tests we'll install Jupyter so we can run a notebook.

(rapids-23.12)$ pip install jupyter

Ok, we're pretty much done with our setup now. Let's get on to the good stuff.

Note that I ran these tests on an HP Envy Desktop TE02 running Windows 11 under an Ubuntu WSL with an NVIDIA GeForce RTX 4070 Ti GPU Processor and a 13th Gen Intel i9–13900K, 3000 Mhz, 24 Core(s), 32 Logical Processor(s).

The version of Pandas used was 1.5.3

Our Input data

For input data, we'll use an open-source dataset hosted on Kaggle. The data I've chosen is the NYC FHV (Uber/Lyft) Trip Data Expanded (2019–2022) and can be found here:

NYC FHV (Uber/Lyft) Trip Data Expanded (2019-2022)

This data set was originally sourced from the nyc.gov website and is governed under the nyc.gov Terms of Use. The dataset is made available by NYC Open Data, which makes its data available under the Creative Commons CC0: Public Domain license as cited under the company's Kaggle account dataset Metadata section. This means that anyone can "… copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission."

I downloaded the data to my local desktop. There were several ancillary and look-up files but the main data set consisted of 46 Parquet files with a total size of 18 GB and comprised the following columns. During initial testing, the whole data set proved too big to read into a dataframe so I copied the files just for the first 6 months of 2021 and used them as my inputs. Here is a description of the columns and their meaning.

Field Name                Description
----------                -------------
Hvfhs_license_num         The TLC license number of the HVFHS base or business
                          As of September 2019, the HVFHS licensees are the following:
                              • HV0002: Juno
                              • HV0003: Uber
                              • HV0004: Via
                              • HV0005: Lyft
Dispatching_base_num      The TLC Base License Number of the base that dispatched the trip
Pickup_datetime           The date and time of the trip pick-up
DropOff_datetime          The date and time of the trip drop-off
PULocationID              TLC Taxi Zone in which the trip began
DOLocationID              TLC Taxi Zone in which the trip ended
originating_base_num      base number of the base that received the original trip request
request_datetime          date/time when passenger requested to be picked up
on_scene_datetime         date/time when driver arrived at the pick-up location 
                          (Accessible Vehicles-only)
trip_miles                total miles for passenger trip
trip_time                 total time in seconds for passenger trip
base_passenger_fare       base passenger fare before tolls, tips, taxes, and fees
tolls                     total amount of all tolls paid in trip
bcf                       total amount collected in trip for Black Car Fund
sales_tax                 total amount collected in trip for NYS sales tax
congestion_surcharge      total amount collected in trip for NYS congestion surcharge
                          airport_fee $2.50 for both drop off and pick up at LaGuardia, 
                          Newark, and John F. Kennedy airports
tips                      total amount of tips received from passenger
driver_pay                total driver pay (not including tolls or tips and net of commission, 
                          surcharges, or taxes)
shared_request_flag       Did the passenger agree to a shared/pooled ride, regardless 
                          of whether they were matched? (Y/N)
shared_match_flag         Did the passenger share the vehicle with another passenger who 
                          booked separately at any point during the trip? (Y/N)
access_a_ride_flag        Was the trip administered on behalf of the Metropolitan 
                          Transportation Authority   (MTA)? (Y/N)
wav_request_flag          Did the passenger request a wheelchair-accessible 
                          vehicle (WAV)? (Y/N)
wav_match_flag            Did the trip occur in a wheelchair-accessible 
                          vehicle (WAV)? (Y/N

Start up your Jupyter Notebook.

(rapids-23.12)$ jupyter notebook

Your notebook may start automatically in the browser, but if it doesn't, at the end of the above command output, you'll see a couple of links you can use to paste into your browser to start the notebook manually, like this:-


    To access the server, open this file in a browser:
        file:///home/tom/.local/share/jupyter/runtime/jpserver-897-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/tree?token=703515a870b45b58eed9b73b69621b3f0f1cb790fefaa1b0
        http://127.0.0.1:8888/tree?token=703515a870b45b58eed9b73b69621b3f0f1cb790fefaa1b0

Copy one of the URLs and paste it into a browser and the notebook should start.

OK, it's time for some coding, but before we do anything we should verify that our system has an Nvidia GPU. In a notebook cell type in and run the command !nvidia-smi

!nvidia-smi

Thu Dec 21 10:57:54 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.01              Driver Version: 546.01       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  | 00000000:01:00.0  On |                  N/A |
| 32%   24C    P8               9W / 285W |    579MiB / 12282MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Great, we can start the test. The same code is used for both CuDF Pandas and regular Pandas. The only difference is, that before running the cuDF Pandas we need to load an extension using the Jupyter Notebook %load_ext cudf.pandas magic command.

Test 1: Reading the input data into a Pandas dataframe

First, read the data using regular Pandas.

!dir /mnt/d/test2

fhvhv_tripdata_2021-01.parquet     fhvhv_tripdata_2021-04.parquet
fhvhv_tripdata_2021-02.parquet     fhvhv_tripdata_2021-05.parquet
fhvhv_tripdata_2021-03.parquet     fhvhv_tripdata_2021-06.parquet
%%time

# regular Pandas first

import pandas as pd

print(pd)

df=pd.read_parquet("/mnt/d/test2/")
print('df shape = ' , df.shape)

Pandas Output.



df shape =  (81542237, 24)

CPU times: user 32.6 s, sys: 9.9 s, total: 42.5 s
Wall time: 1min 6s

Now we can load our extension. After that, any Pandas operations will use cuDF.Pandas processing automatically.

%load_ext cudf.pandas
%%time

import pandas as pd

# Now we're using cuDF Pandas
print(pd)

df=pd.read_parquet("/mnt/d/test2/")
print('df shape = ' , df.shape)

cuDF Pandas Output



df2 shape =  (81542237, 24)

CPU times: user 6.41 s, sys: 344 ms, total: 6.75 s
Wall time: 19.3 s

A great start, we saved around two-thirds of the time it took to load in our data set versus regular Pandas. Ok, let's do some data wrangling on our dataframe.

Test 2: Find the time of day where most pickups happened

%%time
df['pickup_hour'] = df['pickup_datetime'].dt.hour
df['pickup_minute'] = df['pickup_datetime'].dt.minute
pickup_time_hour = df['pickup_hour'].value_counts()
pickup_time_minute = df['pickup_minute'].value_counts()
busiest_pickup_time = str(pickup_time_hour.idxmax()).zfill(2) + ':' + str(pickup_time_minute.idxmax()).zfill(2)
print("The busiest pick-up time was ",busiest_pickup_time)

Pandas Output.

The busiest pick-up time was 18:05

CPU times: user 2.32 s, sys: 341 ms, total: 2.67 s
Wall time: 2.89 s

cuDF Pandas Output.

The busiest pick-up time was 18:05

CPU times: user 477 ms, sys: 11.9 ms, total: 488 ms
Wall time: 611 ms

A 4x speed up that time.

Test 3: Find the total revenue for each base

%%time
total_revenue = df.groupby('dispatching_base_num')[['base_passenger_fare']].sum()
print(total_revenue)

Pandas Output

                     base_passenger_fare        tolls         tips  congestion_surcharge
dispatching_base_num                                                                     
B02395                       1.474796e+07    574488.16    402665.20             618686.75
B02510                       4.716340e+08  19379214.63  19215713.10           21879060.50
B02512                       6.186856e+06    236174.51    262159.19             396786.50
B02617                       3.836497e+07   1515676.50   1153720.09            1708146.00
B02682                       4.526493e+07   1847330.64   1334644.86            2144103.50
B02764                       1.469802e+08   5713613.94   4361999.91            7030078.00
B02765                       8.166917e+07   3334981.85   2384550.98            3844106.75
B02800                       1.622731e+07     90408.83    308746.87             987402.25
B02835                       2.867439e+07   1137756.51    821118.32            1370536.75
B02836                       1.931810e+07    775884.67    571422.48             941426.75
B02844                       5.292548e+05     17624.72     19706.52              20352.75
B02864                       4.199323e+07   1655849.89   1122411.34            1843405.50
B02865                       1.177934e+07    469413.75    335015.62             521320.25
B02866                       4.880302e+07   1897281.99   1452356.19            2301458.50
B02867                       2.884545e+07   1157424.65    865592.92            1350514.00
B02869                       6.353139e+07   2598507.32   1825205.33            2926896.50
B02870                       1.389362e+07    562184.92    380304.22             606680.25
B02871                       4.311818e+07   1598683.00   1265205.02            1935304.25
B02872                       1.285820e+08   4825429.91   3727557.88            6387887.00
B02875                       9.814507e+07   3542208.00   2754228.97            4623036.00
B02876                       3.214636e+07   1289244.87    932017.00            1512857.50
B02877                       2.794710e+07   1125953.11    750703.83            1206526.75
B02878                       4.112076e+07   1654463.46   1086647.72            1758366.50
B02879                       2.887706e+07   1152262.66    773781.26            1220865.25
B02880                       1.757034e+07    722508.22    511862.75             806575.00
B02882                       3.406335e+07   1379719.91    979974.21            1546639.25
B02883                       3.696824e+07   1516183.09   1043751.52            1675511.75
B02884                       3.559980e+07   1457659.51   1015496.71            1621966.50
B02887                       4.677254e+07   1911234.21   1298297.31            2048898.50
B02888                       2.586880e+07   1051767.65    764963.45            1195620.25
B02889                       2.111926e+07    836332.75    592278.19             922916.50
B03136                       2.370921e+05       270.42      5050.47              22882.75

CPU times: user 2.19 s, sys: 587 ms, total: 2.78 s
Wall time: 2.94 s

cuDF Pandas Output

# The data outputs were the same as above, here is the the timing
#

CPU times: user 425 ms, sys: 0 ns, total: 425 ms
Wall time: 453 ms

Once again, just over a 4x speed up.

Test 4: Average fare per day

%%time

df['pickup_day_of_week'] = df['pickup_datetime'].dt.day_name()
average_fare_per_day = df.groupby('pickup_day_of_week')['base_passenger_fare'].mean()
print(average_fare_per_day)

Pandas Output

pickup_day_of_week
Friday       21.277356
Monday       20.333885
Saturday     21.122636
Sunday       21.445824
Thursday     20.702871
Tuesday      20.141152
Wednesday    20.375923
Name: base_passenger_fare, dtype: float64

CPU times: user 8.55 s, sys: 1.41 s, total: 9.96 s
Wall time: 10.2 s

cuDF Pandas Output

pickup_day_of_week
Friday       21.277356
Monday       20.333885
Saturday     21.122636
Sunday       21.445824
Thursday     20.702871
Tuesday      20.141152
Wednesday    20.375923
Name: base_passenger_fare, dtype: float64

CPU times: user 8.77 s, sys: 1.28 s, total: 10 s
Wall time: 10 s

I was interested to try and find out why there was no real difference between the regular pandas and cuDF pandas run-time for this last test. Luckily there are some profilers you can use. Let's see one in action.

%%cudf.pandas.line_profile

df['pickup_day_of_week'] = df['pickup_datetime'].dt.day_name()
average_fare_per_day = df.groupby('pickup_day_of_week')['base_passenger_fare'].mean()
print(average_fare_per_day)

This produced the following output.


pickup_day_of_week
Friday       21.277356
Monday       20.333885
Saturday     21.122636
Sunday       21.445824
Thursday     20.702871
Tuesday      20.141152
Wednesday    20.375923
Name: base_passenger_fare, dtype: float64

                                             Total time elapsed: 9.975 seconds                                     

                                                           Stats                                                   

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Line no. ┃ Line                                                                ┃ GPU TIME(s) ┃ CPU TIME(s) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1        │df2['pickup_day_of_week'] = df2['pickup_datetime'].dt.day_name()     │ 2.059699706 │ 7.484354633 │
│          │                                                                     │             │             │
│ 2        │average_fare_per_day = df2.groupby('pickup_day_of_week')['base_pass... │ 0.219562568 │             │
│          │                                                                     │             │             │
│ 3        │print(average_fare_per_day)                                          │ 0.005559790 │             │
│          │                                                                     │             │             │
└──────────┴───────────────────────────────────────────────────────────────────────────────────┴─────────────┘

It's clear from this output that the first line of code ran mostly in regular CPU rather than GPU and I'm assuming that's because date operations don't scale well, or run at all, on a GPU.

Summary

In summary, the total time taken for all tests for Pandas was 82.3seconds and the total time using cuDF Pandas was 30.36 seconds so that is a 60% improvment. Not too shabby, albeit the majority of the time saved was in the loading of the input data.

One of the most important points about using cuDF Pandas is that you don't need to change your Pandas code at all. A simple Notebook extension load or an extra flag in a Python call before running your program is all you need. The initial setting up of your system may be a bit complicated but it's a one-off operation and I think the potential benefits in your run-times will make up for that.

_That's all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at this link. From there you see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

Click on these links for more of my articles related to subjects covered in this story that I think you'll enjoy.

DuckDB

Pandas 2.0 – Numpy vs Arrow back-end

Tags: Cudf Data Engineering Gpu Pandas Python

Comment