Metal Programming in Julia

Author:Murphy  |  View: 30145  |  Time: 2025-03-22 23:50:41
Little Heavy | Image by Author

Introduction

Just last year, we were introduced to the Metal.jl Framework, a GPU backend for Apple Hardware. This is exciting news for Julia practitioners looking to leverage the full potential of their macOS M-series chips. In particular, data scientists and ML engineers can speed up their computational workflows by tapping into the parallel processing power of GPUs, resulting in faster training and inference times. The integration of Metal.jl into the Julia ecosystem signifies an important push towards aligning the language's capabilities with the continually evolving landscape of scientific computing and machine learning on Apple platforms.

In 2020, Apple began transitioning its Mac lineup from Intel-based processors to Apple Silicon, starting with the M1 chip. While this has been a historic and impressive achievement from Apple it did come with its fair share of criticisms and issues. Since picking up my 32-core Mac Studio M1-chip, I have been looking to fully leverage the GPU and tinker with new applications. I must say, it hasn't been all fun and games. From ARM architecture compatibility issues to unsupported machine learning libraries – it has been a challenge at times to get a working environment. This is expected with any huge major transition and way of operating. I remain hopeful and have seen major improvements across the board for stability and features.

In this article, we will preview the Metal.jl Framework in order to understand its capabilities. We will also demonstrate a practical example using Flux, a library for machine learning in Julia, with the Metal backend.

Here is the outline for the topics covered:

I. Project Setup i. Julia environment setup ii. Dependency overview

II. Leveraging the Metal API i. Kernel Functions ii. Benchmarking iii. Profiling

III. Working with Flux and Metal Backend i. Dataset overview ii. Simple neural network iii. Model evaluation

Readers who wish to follow along should have:

  1. Basic knowledge of the Julia programming language.
  2. High-level understanding of Machine Learning concepts.

Let's dive in!


PROJECT SETUP

i. Julia Environment Setup

It is considered good practice to set up an environment unique to your project. By doing so you will isolate exact versions of packages required for a project and allow for an easily reproducible environment for yourself and team members. This is easily done in Julia as seen below.

ii. Dependency overview

Metal: This is the framework that makes it possible to program GPUs on macOS. As mentioned by the contributors, the package is a work-in-progress and there are bugs, functionality missing, and performance that hasn't been fully optimized yet. Please ensure you have also met the following system requirements:

✔ Mac device with M-series chip️️️ ✔ Julia 1.8–1.10 ✔ macOS 13 (Ventura) or 14 (Sonoma)

BenchmarkTools: We will be using this library to execute benchmarks for some of the operations we send to our GPU via the Metal APIs. This package makes it easy to configure, execute, and analyze results.

Flux: Flux is an intuitive machine learning library for Julia; it is designed to provide a high-level and user-friendly interface for building and training neural networks. We will be using this library for the example and leveraging the Metal Backend to leverage our GPUs.

Below are the versions for the dependencies at the time of this article.

julia> # press "]" to get into the package manager
(jl_jQfFj6) pkg> status
Status `/private/var/folders/33/mcgc5pgd2ls6yp3ggq6rszvc0000gn/T/jl_jQfFj6/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [587475ba] Flux v0.14.6
  [dde4c033] Metal v0.5.1

With our environment configured and a high-level understanding of the libraries in use, let's explore the Metal API.

LEVERAGING THE METAL API

Metal.jl interfaces with Apple's Metal Graphics API – a low-level API developed by Apple for their various platforms (macOS, iPhone, watch,…).

This gives users direct control over the GPU (Graphics Processing Unit) for tasks such as rendering graphics and parallel computations. Let's look at a basic example.

i. Kernel Functions

In the context of the Apple Metal framework, a Kernel Function is a special type of function that is executed on the GPU. These functions are written in a shading language; in our case, the Metal Shading Language (MSL).

"MSL allows users to write a shader program, which is graphics and data-parallel compute code that runs on the GPU. Shader programs run on different programmable units of the GPU. MSL is a single, unified language that allows tighter integration between the graphics and compute programs." ³

Prior to getting started, let's ensure we can monitor our GPU's load. Apple has built-in GPU history. In the Activity Monitor app on your Mac, choose Window > GPU History. You should see something similar to mine:

GPU History. Spikes to the top indicate max GPU usage | Image by Author

For this example, we will create a matrix multiplication kernel. To make use of the GPU, we will intentionally create compute complexity by iterating N = 1 million times over the matrix operation.

We will define A as an m x n matrix, B an n x p matrix, and C as the resulting matrix product defined by m x p. C=AB. The resulting Kernel Function is shown below; other than a few tweaks to the Kernel Function and the Matrix variables defined at the bottom, the code is quite similar to what you would expect in Julia.

<script src="https://gist.github.com/lausena/6b2f227c5dac99e612d8def0ae4caab9.js"></script>

The subtle differences, or additions, are as follows: The thread_position_in_grid_1d() function returns the index of the current thread within its one-dimensional grid. Essentially, it will know which data on each thread the GPU should operate. This fine-grained approach in controlling thread assignment gives the user the power to maximize the computational power of their system.

When we initialize our matrices A, B, and C we want to ensure we are initializing values on the GPU using Metal; this ensures we are making it suitable for GPU-accelerated computations. In addition, the storage=Shared parameter indicates that the matrix data should be stored in shared memory, giving access to both CPU and GPU. By doing so, we need to ensure we synchronize prior to accessing that resource:

"Ensure that all changes you schedule on either the CPU or GPU for a resource that uses shared memory complete before accessing that resource on the other processor." ²

Last thing to mention is the C_cpu variable at the end. The unsafe_wrap function is used to create a CPU array, C_cpu, that shares the same underlying memory as the GPU array C. This enables us to transfer data between the CPU and GPU in order to perform computations later on. In the example below, I will show that we can modify the resulting matrix C by doing a inverse on it after all GPU operations have completed (_inv(Ccpu)).

Great! Now that we have our kernel ready, let's move on to do the computation and benchmarking.

ii. Benchmarking

<script src="https://gist.github.com/lausena/635d503bd3f533e80725e4700875c4ad.js"></script>

The benchmark macro (macros start with "@") is used to measure the performance of the GPU kernel matrix_multiplication_kernel.

Prior to executing our begin...end block we will use the Metal.@sync to ensure synchronization between the CPU and GPU occurs and that all GPU commands have completed prior to moving onto the CPU code.

Within the metal macro, we specify both the number of threadsand groups. Each thread is responsible for performing a computation. This number also determines how many operations occur in parallel. For instance, if we specify 256 threads then each thread will be responsible for a portion of the computation. Threads are organized in groups; a thread group is a collection of threads that can work together and coordinate the parallel execution of threads.

In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs,⁴ which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.⁵ ⁶

Here are the results after playing around with the N (100), threads (256), and groups (256) parameters.

Benchmark Results | Image by Author

iii. Profiling

In order to Profile and view the results you must have XCode installed.

Profiling code, an often overlooked discipline, is a crucial aspect of software development and performance optimization. Profiling involves measuring various aspects of a program's execute. This can be things such as time taken by different functions, frequency of function calls, memory and usage or leaks. Profiling is also useful for ensuring any code changes made are quantified by means of performance.

If you've experienced a scenario where code deployed to production unexpectedly causes a significant slowdown in system performance, only to discover later that a junior programmer inadvertently introduced a nested loop with quadratic time complexity O(n²), you're not alone. While nested for-loops may seem acceptable in some situations the implications become increasingly problematic with a large number of values – leading to considerable performance challenges. Catch this early, catch this with profiling!

Profiling is done trivially with two steps. First, we need to specify the Metal.@profile macro in front of our metal macro. Next, ensure you have the following environment variable set: ENV["METAL_CAPTURE_ENABLED"] = 1

Now you can execute the following line:

Metal.@profile @metal threads=n_threads groups=n_groups matrix_multiplication_kernel(A, B, C)

From there, we get a resulting julia_capture_N.gputrace stored in the same directory as the project. To interact with this, open it in XCode and replay the trace. We are presented with various useful metrics that we can dig into.

Xcode | Image by Author

At this point, we have covered how to interact with Apple's Metal Graphics API through Metal.jl – giving us direct GPU control for parallel computations. We have introduced Kernel Functions in the Metal Shading Language with a matrix multiplication kernel that intentionally increases GPU workloads. Additionally, we have introduced benchmarking tools and profiling capabilities through Metal's macros.

Let's move on to a practical scenario!

WORKING WITH FLUX AND METAL BACKEND

i. Dataset overview

We will be using a Breast Cancer database obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.¹⁰ ¹¹ The Class feature will be the target where there are two possible values (2 for benign and 4 for malignant).

For brevity, I will not demonstrate the preprocessing required for the dataset, instead I will refer the reader to Appendix I: The Julia source code.

ii. Simple neural network

We will be building a simple neural network with Flux that leverages the Metal backend. Starting with v0.14, Flux doesn't force a specific GPU backend and the corresponding package dependencies on the users.⁷ Although we are using Metal as the backend, Flux supports other backends such as AMDGPU and CUDA.

Let's do a quick sanity check to ensure our environment is setup.

Testing Metal with Flux | Image by Author

Now that we have the device variable we will be using that to move data and model to the GPU. Be careful here as you can perform all business logic in Flux without ever exporting the model to the GPU – meaning, you're still using the CPU (yikes)!

Model Definition (This will be a Logistic Regression model for our 2-class problem) | Image by Author

To prepare our data for the run, we will be leveraging Flux.DataLoader. This module handles iterations over mini-batches of data. I've keep it simple for this demo with a batchsize=1 . This means that if my data contains 800 rows each batch is associated to a row. In a more efficient scenario you may want to split that up so you can group the data you are processing.

Before jumping into the next section, it is important to note that I faced issues passing my DataFrame directly to the DataLoader, so here are a few workarounds I had to implement.

Data Prep | Image by Author

iii. Model evaluation

The following code illustrates the model training and evaluation process. It is crucial to utilize the GPU when the x_cpu and y_cpu variables (our rows and labels) are returned from the DataLoader – a failure to do so will crash due to compatibility issues as our model is expecting data on the GPU to work with.

Demo of project running and leveraging M1-Max GPU! | Image by Author

In this brief look into the Metal.jl Framework, we have covered the basics of GPU Programming on Apple's hardware. By exploring three core concepts – Kernel Functions, Benchmarking, and Profiling – the reader now has a high-level understanding of how to get started and is encouraged to dive deeper into the API's capabilities. Additionally, we have demonstrated a practical example that leverages Flux to build a simple neural network application that leverages the Metal backend. If you're a data scientist or ML engineer looking to tap into the power of your Apple M-series GPU, I hope this article has served as a starting point for accelerating your computational workflows. From here, I will leave the reader with a few more resources to check out:

Welcome · Flux

I hope you enjoyed this article, thank you for reading!

Tags: Apple Data Science Gpu Julialang Programming

Comment