Training XGBoost On A 1TB Dataset

Author:Murphy | View: 24549 | Time: 2025-03-23 19:52:19

As Machine Learning continues to evolve we're seeing larger models with more and more parameters. At the same time we also see incredibly large datasets, at the end of the day any model is only as good as the data that it's trained on. Working with large models and datasets can be computationally expensive and difficult to iterate or experiment on in a timely manner. For this article we'll focus on the large dataset portion of the problem. Specifically we will look into something known as Distributed Data Parallel utilizing Amazon SageMaker to optimize and reduce training time across a large real-world dataset.

For today's example we'll train the SageMaker XGBoost algorithm on an artificially generated 1 TB dataset. In this example we'll get a deeper understanding of how to prepare and structure your data source for faster training as well as understand how to kick off distributed training with SageMaker's built in Data Parallelism Library.

NOTE: This article will assume basic knowledge of AWS and SageMaker specific sub-features such as SageMaker Training and interacting with SageMaker and AWS as a whole via the SageMaker Python SDK and Boto3 AWS Python SDK. For a proper introduction and overview of SageMaker Training, I would reference this article.

What is Distributed Data Parallel? Why do we need it?

Before we can get started with the implementation it's crucial to understand Distributed Data Training. With large datasets it's really difficult to optimize training times as this is very computationally intensive. You have to consider both being able to download the dataset into memory as well as whatever training and hyperparameter computations the machine will have to perform. With a single machine this can be possible (if computationally powerful enough), but also inefficient from a time stand point and experimentation becomes a nightmare.

With Distributed Data Parallel you can work with a cluster of instances. Each of these instances can contain multiple CPUs/GPUs. Creating this Distributed Data Parallel setup from ground-up can be challenging and there is a lot of overhead with node to node communication that needs to be addressed. To simplify matters we can utilize the built-in Sagemaker Distributed Data Parallel Library. Here the hard work of building and optimizing node to node communication is abstracted out and you can focus on model development.

Why does data source matter with SageMaker?

Where and how our data is provided is essential to optimizing training time. With SageMaker the de-facto storage service has always been S3 and that's still an option here. There's the vanilla training mode where you can upload your dataset directly into S3, this is known as File Mode. Here SageMaker downloads your dataset into the instance memory before training kicks off. For today's example we will work with an optimized S3 mode known as Fast File Mode. With Fast File Mode the dataset is streamed into the instance in real-time so we can avoid the overhead of downloading the entire dataset. This also leads to the question, how should I provide/split my dataset? For this example we will split our dataset into multiple smaller files that add up to 1TB, this once again will help with download or in our case streaming time as well as our dataset scale is quite large.

Outside of S3 there's also options to work with Elastic File System (EFS) and FsX Lustre on SageMaker. If your training data already resides on EFS it is easy to mount onto SageMaker. With FsX Lustre you can scale at a greater rate compared to other options, but there is operational overhead of setting up the VPC for this option.

At the end of the day there's numerous factors you should consider when deciding between the different training options for the Data Source with SageMaker. The two major points to consider are the dataset size and how you can shard the dataset, these factors combined with the current location of your dataset will help you make the right choice to optimize training time. For a more comprehensive guide please reference this article around data sources with SageMaker Training.

Dataset Creation

For this example we'll utilize the Abalone dataset and run a SageMaker Xgboost algorithm on it for a regression model. You can download the dataset from the publically available Amazon datasets.

#retreive data
aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .

This dataset itself is only a 100KB, so we need to make numerous copies of it to create a 1TB dataset. For this dataset preparation, I utilized an EC2 instance (r6a.48xlarge) for development. This is a high memory and compute instance that will allow for quick preparation of our dataset. Once setup we run the following script to make our dataset into a larger 100MB file. You can splice your actual data as needed, this is not a set recipe/size that needs to be followed.

import os
import pandas as pd
import sys

#~110KB initial file
df = pd.read_csv("abalone_dataset1_train.csv")
print(sys.getsizeof(df))

#creates a 104MB file
df_larger = pd.concat([df]*700, ignore_index=True)
print(sys.getsizeof(df_larger))

df_larger.to_csv("abalone-100mb.csv")

With the 100MB dataset we can make 10,000 copies to create our 1TB dataset. We then upload these 10,000 copies to S3 which is our Data Source for FastFile mode.

%%sh

#replace with your S3 bucket to upload to
s3_bucket='sagemaker-us-east-1-474422712127'

for i in {0..10000}
do
  aws s3 cp abalone-100mb.csv s3://$s3_bucket/xgboost-1TB/abalone-$i.csv 
done

This script should take about 2 hours to run, but if you would like to speed up the operation you can use some form of multiprocessing Python code with Boto3 to speed up the upload time.

Training Setup

Before we can get to running a SageMaker Training Job we need to setup the proper clients and configuration. Here we specifically define our training instance type, you can find an extensive list of options at the following page. In terms of choosing an instance type you want to consider the type of model you are dealing with and the domain you're in. With NLP and Computer Vision use-cases generally GPU instances have proven to be a better fit, in this case we use a memory optimized instance with our XGBoost algorithm.

import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker_session = sagemaker.Session()
base_job_prefix = 'xgboost-example'
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = 'ml.m5.24xlarge'

Next we prepare our TrainingInput, here we specify that we are utilizing FastFile mode, otherwise it defaults to File mode. We also specify the distribution as "ShardedByS3Key", this indicates we want to distribute all our different S3 files across all instances. Otherwise all data files will get loaded into each and every single instance leading to a much longer training time.

from sagemaker.inputs import TrainingInput

#replace with your S3 Bucket with data
training_path = 's3://sagemaker-us-east-1-474422712127/xgboost-1TB/'

#set distribution to ShardedByS3Key otherwise a copy of all files will be made across all instances
#we also enable FastFile mode here where as the default is File mode
train_input = TrainingInput(training_path, content_type="text/csv", 
input_mode='FastFile', distribution = "ShardedByS3Key")
training_path

We then prepare our XGBoost estimator, to get a deeper understanding of the algorithm and it's available hyperparameters, please reference this article. The other key here is that we specify our instance count to be 25 (please note you may need to request a limit increase here depending on the instance). Our 10,000 data files will be distributed across these 25 ml.m5.24xlarge for instances. Once we specify a count greater than one, SageMaker infers Distributed Data Parallel for our model.

model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=25,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,

)

xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)
training_instance_type

We can then kick off a training job by fitting the algorithm on the training input.

xgb_train.fit({'train': train_input})

With our current setup this training job takes approximately 11 hours to complete.

Distributed Setup (Screenshot by Author)

SageMaker Training Job Completed (Screenshot by Author)

How can we further optimize?

We can tune this training time in a few different ways. One option is simply horizontally scaling by increasing the instance count. Another option is going the GPU instance route which may require a smaller instance count, but this is not always a direct science.

Outside of tuning the hardware behind the training job we can revisit the data source format we were talking about. You can evaluate FsX Lustre which can scale to 100s of GB/s throughput. Another option is sharding the dataset in a different format to try various combinations of number of files and file size.

Pricing

For SageMaker Training Jobs you can estimate cost at the following page. In essence training jobs are billed by the hour and the instance type. With a ml.m5.24xlarge this comes out to $5.53 per hour. With 25 instances and a run time of 11 hours this training job comes out to approximately $1500, so please keep this in mind if you run the example. Once again you can further tune this by testing out different instances on smaller subsets of data to estimate an approximate training time before training on your entire corpus.

Credits/Additional Resources

Conclusion

GitHub – RamVegiraju/distributed-xgboost-sagemaker: Example of training XGBoost algorithm on a 1TB…

You can find the entire code for the example above. SageMaker Distributed Training offers the ability to train at scale. I also encourage you to explore Model Parallel, as the name indicates this looks into model parallelism across multiple instances. I hope this article was a useful introduction to SageMaker Training and it's distributed capabilities, please feel free to follow up with any questions.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you're new to Medium, sign up using my Membership Referral.

Tags: AWS Data Science Machine Learning Sagemaker Xgboost