Matrix Multiplication on GPU

Author:Murphy  |  View: 28479  |  Time: 2025-03-23 12:27:54

This blog came from a sudden realisation of how little I knew about how matrix multiplication works on the GPU. Having done so many ML projects, I feel like I ought to understand how the most important operation in ML works: What is this "Tensor Core" thing? Why does everyone say "data movement is the bottleneck"? How fast can GPUs actually go?

To answer these questions, I decided that I must go out of my PyTorch bubble and venture into the abyss of Cuda. I wrote this blog to document all that I have learnt, and hopefully anyone reading this wouldn't have to go through the pain of digging through CUDA docs/code as I did.

If there is anything that I've learnt in this journey, it is concurrent matrix multiplication is HARD. Efficient matrix multiplication heavily depends on the specific hardware you are using and the problem size you are trying to solve. There is no one-size-fits-all solution.

Enough nagging, let's dig in!

Recap on GPU architecture

Let's remind ourselves how (NVIDIA) GPUs work. A GPU achieves parallelism by running many threads. Each thread is executed on a single CUDA core, though at a given time, only a subset of the threads are active so there can be many more threads than CUDA cores available. Each thread, no matter it is active or not, has its own set of registers.

A group of 32 threads is known as a warp. All threads in a warp must execute together (or be inactive together). In most cases, there are a lot more inactive warps than active warps, and the warp scheduler is responsible for choosing which warps to execute at a given time. This allows the GPU to hide the latency of memory accesses by scheduling other warps to execute while a warp is waiting for data.

A group of warps is known as a threadblock. All warps in a threadblock are executed in the same Streaming Multiprocessor (SM). Each threadblock has its own shared memory that can be accessed by all threads in the threadblock.

Note: Newer architectures

From Volta architecture onwards, each thread also has its own program counter and call stack etc. This means that each thread in a warp can execute different instructions at the same time.

The Volta architecture also introduced Tensor Cores that are specialised to solve matrix multiplications of specific sizes. Each active warp have access to one Tensor Core.

In the newest Hopper architecture, there is a concept of threadblock clusters that represents a group of threadblocks. It gives the user more fine-grained control over the scheduling of threadblocks and allows the shared memory of one threadblock to be access by other threadblocks in the same cluster.

Parallelising matrix multiplication

Suppose we want to compute:

We say that the problem size is (M, N, K) in this case. To parallelise this operation, we can split A and B into smaller matrices, matrix multiply them individually, and concatenate the results to form C.

Specifically, we can partition A row-wise (i.e., M into chunks of size M') and B column-wise (i.e., N into chunks of size N') to give:

We can see that each sub-matrix in C are independent of each other, so we can easily parallelise the computation of each sub-matrix.

In practice, K might be too large to directly load into memory and compute on. Instead, a typical implementation will also split K into chunks of size K', iterate over each chunk, and accumulate (by summing) over the partial results. This is known as serial-K reduction. (As opposed to parallel-K reduction, see section below). Mathematically, this looks like:

Note: Padding

At any point where the problem size is not divisible by the partition size, we need to add padding. This is typically done implicitly when we load the partitioned inputs (

Tags: Cuda Efficiency Gpu Hardware Matrix Multiplication

Comment