PyTorch Model Performance Analysis and Optimization

Training Deep Learning models, especially large ones, can be a costly expenditure. One of the main methods we have at our disposal for managing these costs is performance optimization. Performance optimization is an iterative process in which we consistently search for opportunities to increase the performance of our application and then take advantage of those opportunities. In previous posts (e.g., here) we have stressed the importance of having appropriate tools for conducting this analysis. The tools of choice will likely depend on a number of factors including the type of training accelerator (e.g., GPU, HPU, or other) and the training framework.

The focus in this post will be on training in PyTorch on GPU. More specifically, we will focus on the PyTorch's built-in performance analyzer, PyTorch Profiler, and on one of the ways to view its results, the PyTorch Profiler TensorBoard plugin.
This post is not meant to be a replacement for the official PyTorch documentation on either PyTorch Profiler or the use of the TensorBoard plugin for analyzing the profiler results. Our intention is rather to demonstrate how these tools might be used during the course of one's daily development. In fact, if you haven't already, we recommend that you take a look over the official documentation before reading this post.
For a while, I have been intrigued by one portion in particular of the TensorBoard-plugin tutorial. The tutorial introduces a classification model (based on the Resnet architecture) that is trained on the popular Cifar10 dataset. It proceeds to demonstrate how PyTorch Profiler and the TensorBoard plugin can be used to identify and fix a bottleneck in the data loader. Performance bottlenecks in the input data pipeline are not uncommon and we have discussed them at length in some of our previous posts (e.g., here). What is surprising about the tutorial is the final (post-Optimization) results that are presented (as of the time of this writing) which we have pasted in below:

If you look closely, you will see that the post-optimization GPU utilization is 40.46%. Now there is no way to sugarcoat this: These results are absolutely abysmal and should keep you up at night. As we have expanded on in the past (e.g., here), the GPU is the most expensive resource in our training machine and our goal should be to maximize its utilization. A 40.46% utilization result usually represents a significant opportunity for training acceleration and cost savings. Surely, we can do better! In this blog post we will try to do better. We will start by attempting to reproduce the results presented in the official tutorial and see whether we can use the same tools to further improve the training performance.
Toy Example
The code block below contains the training loop defined by the TensorBoard-plugin tutorial, with two minor modifications:
- We use a fake dataset with the same properties and behaviors as the CIFAR10 dataset that was used in the tutorial. The motivation for this change can be found here.
- We initialize the torch.profiler.schedule with the warmup flag set to 3 and the repeat flag set to 1. We found that this slight increase in the number of warmup steps improves the stability of the profiling results.
import numpy as np
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T
from torchvision.datasets.vision import VisionDataset
from PIL import Image
class FakeCIFAR(VisionDataset):
def __init__(self, transform):
super().__init__(root=None, transform=transform)
self.data = np.random.randint(low=0,high=256,size=(10000,32,32,3),dtype=np.uint8)
self.targets = np.random.randint(low=0,high=10,size=(10000),dtype=np.uint8).tolist()
def __getitem__(self, index):
img, target = self.data[index], self.targets[index]
img = Image.fromarray(img)
if self.transform is not None:
img = self.transform(img)
return img, target
def __len__(self) -> int:
return len(self.data)
transform = T.Compose(
[T.Resize(224),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = FakeCIFAR(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32,
shuffle=True)
device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
# train step
def train(data):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# training loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
if step >= (1 + 4 + 3) * 1:
break
train(batch_data)
prof.step() # Need to call this at the end of each step
The GPU that was used in the tutorial was a Tesla V100-DGXS-32GB. In this post we attempt to reproduce — and improve on — the performance results from the tutorial using an Amazon EC2 p3.2xlarge instance that contains a Tesla V100-SXM2–16GB GPU. Although they share the same architecture, there are some differences between the two GPUs which you can learn about here. We ran the training script using an AWS PyTorch 2.0 Docker image. The performance results of the training script as displayed in the overview page of the TensorBoard viewer is captured in the image below:

We first note that, contrary to the tutorial, the Overview page (of torch-tb-profiler version 0.4.1) in our experiment combined the three profiling steps into one . Thus, the average overall step time is 80 milliseconds and not 240 milliseconds as reported. This can be seen clearly in the Trace tab (which, in our experience, almost always provides a more accurate report) where each step takes ~80 milliseconds.

Note that our starting point of 31.65% GPU utilization and a step time of 80 milliseconds is different than the starting point presented in the tutorial of 23.54% and 132 milliseconds, respectively. This is likely a result of differences in the training environment including the GPU type and the Pytorch version. We also note that while the tutorial baseline results clearly diagnose the performance issue as a bottleneck in the DataLoader, our results do not. We have often found that data loading bottlenecks will disguise themselves as a high percentage of "CPU Exec" or "Other" in the Overview tab.
Optimization #1: Multi-process Data Loading
Let's start by applying multi process data loading as described in the tutorial. Being that the Amazon EC2 p3.2xlarge instance has 8 vCPUs, we set the number of DataLoader workers to 8 for maximum performance:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32,
shuffle=True, num_workers=8)
The results of this optimization are displayed below:

The change to a single line of code increased the GPU utilization by more than 200% (31.65% from to 72.81%), and more than halved our training step time, (from 80 milliseconds down to 37).
This is where the optimization process in the tutorial comes to end. Although our GPU utilization (72.81%) is quite a bit higher than the results in the tutorial (40.46%), I have no doubt that, like us, you find these results to still be quite unsatisfactory.
Personal commentary that you should feel free to skip: Imagine how much global money could be saved if PyTorch applied multi-process data loading by default when training on GPU! True, there may be some unwanted side-effects to using multiprocessing. Nevertheless, there must be some form of auto-detection algorithm that could be run that would rule out the presence of potentially problematic scenarios and apply this optimization accordingly.
Optimization #2: Memory Pinning
If we analyze the Trace view of our last experiment, we can see that a significant amount of time (10 out of 37 milliseconds) is still spent on loading the training data into the GPU.

To address this, we will apply another PyTorch-recommended optimization for streamlining the data input flow, memory pinning. Using pinned memory can increase the speed of host to GPU data copy and, more importantly, allows us to make them asynchronous. This means that we can prepare the next training batch in the GPU in parallel to running the training step on the current batch. It is important to note that although asynchronous execution will generally increase performance, it can also reduce the accuracy of time measurements. For the purposes of our blog post we will continue to use the measurements reported by PyTorch Profiler. See here for instructions on how to attain precise measurements. For additional details on memory pinning and its side effects, please see the PyTorch documentation.
This memory-pinning optimization requires changes to two lines of code. First, we set the _pinmemory flag of the DataLoader to True.
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32,
shuffle=True, num_workers=8, pin_memory=True)
Then we modify the host-to-device memory transfer (in the train function) to be non-blocking:
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
The results of the memory pinning optimization are displayed below:

Our GPU utilization now stands at a respectable 92.37% and our step time has further decreased. But we can still do better. Note that despite this optimization, the performance report continues to indicate that we are spending a lot of time copying the data into the GPU. We will come back to this in step 4 below.
Optimization #3: Increase Batch Size
For our next optimization, we draw our attention to the Memory View of the last experiment:

The chart shows that out of 16 GB of GPU memory, we are peaking at less than 1 GB of utilization. This is an extreme example of resource under-utilization that often (though not always) indicates an opportunity to boost performance. One way to control the memory utilization is to increase the batch size. In the image below we display the performance results when we increase the batch size to 512 (and the memory utilization to 11.3 GB).

Although the GPU utilization measure did not change much, our training speed has increased considerably, from 1200 samples per second (46 milliseconds for batch size 32) to 1584 samples per second (324 milliseconds for batch size 512).
Caution: Contrary to our previous optimizations, increasing the batch size could have an impact on the behavior of your training application. Different models exhibit different levels of sensitivity to a change in batch size. Some may require nothing more than some tuning to the optimizer settings. For others, adjusting to a large batch size may be more difficult or even impossible. See this previous post for some of the challenges involved in training on large batches.
Optimization #4: Reduce Host to Device Copy
You probably noticed the big red eyesore representing the host-to-device data copy in the pie chart from our previous results. The most direct way of trying to address this kind of bottleneck is to see if we can reduce the amount of data in each batch. Notice that in the case of our image input, we convert the data type from an 8-bit unsigned integer to a 32-bit float and apply normalization before performing the data copy. In the code block below, we propose a change to the input data flow in which we delay the data type conversion and normalization until the data is on the GPU:
# maintain the image input as an 8-bit uint8 tensor
transform = T.Compose(
[T.Resize(224),
T.PILToTensor()
])
train_set = FakeCIFAR(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1024,
shuffle=True, num_workers=8,
pin_memory=True)
device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
# train step
def train(data):
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
# convert to float32 and normalize
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
As a result of this change the amount of data being copied from the CPU to the GPU is reduced by 4x and the red eyesore virtually disappears:

We now stand at a new high of 97.51%(!!) GPU utilization and a training speed of 1670 samples per second! Let's see what else we can do.
Optimization #5: Set Gradients to None
At this stage we appear to be fully utilizing the GPU, but that doesn't mean that we can't utilize it more effectively. One popular optimization that is said to reduce memory operations in the GPU is to set the model parameters gradients to None rather than zero in each training step. Please see the PyTorch documentation for more details on this optimization. All that is required to implement this optimization is to set the _set_tonone of the _optimizer.zerograd call to True:
optimizer.zero_grad(set_to_none=True)
In our case this optimization did not boost our performance in any meaningful way.
Optimization #6: Automatic Mixed Precision
The GPU Kernel View displays the amount of time that the GPU kernels were active and can be a helpful resource for improving GPU utilization:

One of the most glaring details in this report is the lack of use of the GPU [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/). Available on relatively newer GPU architectures, Tensor Cores are dedicated processing units for matrix multiplication that can boost AI application performance significantly. Their lack of use may represent a major opportunity for optimization.
Being that Tensor Cores are specifically designed for mixed-precision computing, one straight-forward way to increase their utilization is to modify our model to use Automatic Mixed Precision (AMP). In AMP mode portions of the model are automatically cast to lower-precision 16-bit floats and run on the GPU TensorCores.
Importantly, note that a full implementation of AMP may require gradient scaling which we do not include in our demonstration. Be sure to see the documentation on mixed precision training before adapting it.
The modification to the training step required to enable AMP is demonstrated in the code block below.
def train(data):
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
with torch.autocast(device_type='cuda', dtype=torch.float16):
outputs = model(inputs)
loss = criterion(outputs, labels)
# Note - torch.cuda.amp.GradScaler() may be required
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
The impact to the Tensor Core utilization is displayed in the image below. Although it continues to indicate opportunity for further improvement, with just one line of code the utilization jumped from 0% to 26.3%.

In addition to increasing Tensor Core utilization, using AMP lowers the GPU memory utilization freeing up more space to increase the batch size. The image below captures the training performance results following the AMP optimization and the batch size set to 1024:

Although the GPU utilization has slightly decreased, our primary throughput metric has further increased by nearly 50%, from 1670 samples per second to 2477. We are on a roll!
Caution: Lowering the precision of portions of your model could have a meaningful effect on its convergence. As in the case of increasing the batch size (see above) the impact of using mixed precision will vary per model. In some cases, AMP will work with little to no effort. Other times you might need to work a bit harder to tune the autoscaler. Still other times you might need to set the precision types of different portions of the model explicitly (i.e., manual mixed precision).
For more details on using mixed precision as a method for memory optimization please see our previous blog post on the topic.
Optimization #7: Train in Graph Mode
The final optimization we will apply is model compilation. Contrary to the default PyTorch eager-execution mode in which each PyTorch operation is run "eagerly", the compile API converts your model into an intermediate computation graph which it then compiles into low-level compute kernels in a manner that is optimal for the underlying training accelerator. For more on model compilation in PyTorch 2, check out our previous post on the topic.
The following code block demonstrates the change required to apply model compilation:
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
model = torch.compile(model)
The results of the model compilation optimization are displayed below:

Model compilation further increases our throughput to 3268 samples per second compared to 2477 in the previous experiment, an additional 32% (!!) boost in performance.
The manner in which graph compilation changes the training step is very evident in the different views of the Tensorboard plugin. The Kernel View, for example, indicates the use of new (fused) GPU kernels, and the Trace View (shown below) displays a wholly different pattern than what we saw previously.

Interim Results
In the table below we summarize the results of the successive optimizations we have applied.

By applying our iterative approach of analysis and optimization using PyTorch Profiler and the TensorBoard plugin, we were able to increase performance by 817%!!
Is our work complete? Absolutely not! Each optimization that we implement uncovers new potential opportunities for performance improvement. These opportunities are presented in the form of resources being freed up (e.g., the way in which moving to mixed precision enabled our increasing the batch size) or in the form of newly uncovered performance bottlenecks (e.g., the way in which our final optimization uncovered a bottleneck in host-to-device data transfer). Furthermore, t[here](https://pytorch.org/docs/stable/notes/cuda.html) are many other well-known forms of optimization that we did not attempt in this post (e.g., see here and here). And lastly, new library optimizations (e.g., the model compilation feature that we demonstrated in step 7), are released all the time, further enabling our performance improvement objectives. As we emphasized in the introduction, to fully leverage such opportunities, performance optimization must be an iterative and consistent part of your development workflow.
Summary
In this post we have demonstrated the significant potential of performance optimization on a toy classification model. Although there are other performance analyzers that you can use, each with their pros and cons, we chose PyTorch Profiler and the TensorBoard plugin due to their ease of integration.
We should emphasize that the path to successful optimization will vary greatly based on the details of the training project, including the model architecture and training environment. In practice, reaching your goals may be more difficult than in the example we presented here. Some of the techniques we described may have little impact on your performance or might even make it worse. We also note that the precise optimizations that we chose, and the order in which we chose to apply them, was somewhat arbitrary. You are highly encouraged to develop your own tools and techniques for reaching your optimization goals based on the specific details of your project.
Performance optimization of machine learning workloads is sometimes viewed as secondary, non-critical, and odious. I hope that we have succeeded in convincing you that the potential for savings in development time and cost warrant a meaningful investment in performance analysis and optimization. And, hey, you might even find it to be fun :).
What Next?
This was just the tip of the iceberg. There is a lot more to performance optimization than we have covered here. In a sequel to this post, we will dive into a performance issue that is quite common in PyTorch models in which portions of the computation is run on the CPU rather than the GPU, often in a manner that is unbeknownst to the developer. We also encourage you to check out our other posts on medium, many of which cover different elements of performance optimization of machine learning workloads.