Train your ML models on GPU changing just one line of code

Author:Murphy | View: 24377 | Time: 2025-03-23 19:16:40

Introduction

Graphics Processing Units (GPUs) can significantly accelerate calculations for preprocessing steps or training machine learning models. Training models typically involves compute-intensive matrix multiplications and other operations that can take advantage of a Gpu's massively parallel architecture. Training on large datasets can take hours to run on a single processor. However, if you offload those tasks to a GPU, you can reduce training time to minutes instead.

In this story, we'll show you how to use the ATOM library to easily train your machine learning pipeline on a GPU. ATOM is an open-source Python package designed to help data scientists fasten the exploration of machine learning pipelines. Read this story if you want a gentle introduction to the library.

Set up

ATOM uses cuML as backend library for GPU training. cuML is a suite of fast, GPU-accelerated Machine Learning algorithms designed for data science and analytical tasks. Unfortunately, cuML can not be installed through pip, and is therefore not installed as a dependency of ATOM. Read here how to install it.

Requirements for cuML to take into account:

Operating System: Ubuntu 18.04/20.04 or CentOS 7/8 with gcc/++ 9.0+ or Windows 10+ with WSL2
GPU: NVIDIA Pascal™ or better with Compute capability 6.0+
Drivers: CUDA & NVIDIA Drivers of versions 11.0, 11.2, 11.4 or 11.5

Tip: See this repo to install cuML on SageMaker Studio Lab.

Example

Training transformers and models in atom using a GPU is as easy as initializing atom with parameter device="gpu". The device parameter accepts any string that follows the SYCL_DEVICE_FILTER filter selector. Examples are:

device="cpu" (use CPU)
device="gpu" (use default GPU)
device="gpu:0" (use first GPU)
device="gpu:1" (use second GPU)

Note: ATOM does not support multi-GPU training. If there is more than one GPU on the machine and the device parameter does not specify which one to use, the first one is used by default.

Use the engine parameter to choose between the cuML and sklearnex execution engines. In this story, we will focus on cuML. The XGBoost, LightGBM and CatBoost models come with their own GPU engine. Setting device="gpu" is sufficient to accelerate them with GPU, regardless of the engine parameter. Click here for an overview of the transformers and models that support GPU acceleration.

Tip: If you don't have access to a GPU, you can use online cloud services like Google Colab or Sagemaker Studio Lab to try it out. Make sure to choose the GPU compute type. See this notebook to get you started.

Let's get started with the example.

from atom import ATOMClassifier
from sklearn.datasets import make_classification

# Create a dummy dataset
X, y = make_classification(n_samples=100000, n_features=40)

atom = ATOMClassifier(X, y, device="gpu", engine="cuml", verbose=2)

Not only models, but also transformers can benefit from GPU acceleration. For example, to scale the features to mean=0 and std=1.

atom.scale()

print(atom.dataset)

Since we stated that we want to use the cuML engine, ATOM automatically selects the transformer from that library whenever available.

print(f"Scaler used: {atom.standard}")
print(f"Scaler's module: {atom.standard.__class__.__module__}")

Let's train three models: the Random Forest is available in cuML, the Stochastig Gradient Descent is not, and XGBoost has its own GPU implementation.

atom.run(models=["RF", "SGD", "XGB"])

atom.results

Note the massive difference in training time between the models!

If we check the underlying estimators, we'll see that the the RF model was indeed trained on GPU, the SGD wasn't (since it's not available on cuML, ATOM falls back to the default sklearn implementation), and the XGB model did train on GPU using its native module.

for m in atom.models:
    print(f"{m}'s module: {atom[m].estimator.__class__.__module__}")

Lastly, analyzing the results is as easy as always.

atom.evaluate()

Conclusion

We have shown how to use the ATOM package to train your machine learning pipelines on GPU. ATOM is also capable of acceleration on CPU. Read this story to learn how.

For further information about ATOM, have a look at the package's documentation. For bugs or feature requests, don't hesitate to open an issue on GitHub or send me an email.

References: