Local LLM Fine-Tuning on Mac (M1 16GB)
This article is part of a larger series on using large language models (LLMs) in practice. In a previous post, I showed how to fine-tune an LLM using a single (free) GPU on Google Colab. While that example (and many others) readily runs on Nvidia hardware, they are not easily adapted to M-series Macs. In this article, I walk through an easy way to fine-tune an LLM locally on a Mac.

With the rise of open-source Large Language Models (LLMs) and efficient fine-tuning methods, building custom ML solutions has never been easier. Now, anyone with a single GPU can fine-tune an LLM on their local machine.
However, Mac users have been largely left out of this trend due to Apple's M-series chips. These chips employ a unified memory framework, which precludes the need for a GPU. Thus, many (GPU-centric) open-source tools for running and training LLMs are not compatible with (or don't fully utilize) modern Mac computing power.
I had almost given up on my dreams of training LLMs locally until I discovered the MLX Python library.
MLX
MLX is a Python library developed by Apple's Machine Learning research team to run matrix operations efficiently on Apple silicon. This is important because matrix operations are the core computations underlying neural networks.
The key benefit of MLX is it fully utilizes the M series chips' unified memory paradigm, which enables modest systems (like mine – M1 16GB) to run fine-tuning jobs on large models (e.g., Mistral 7b Instruct).
While the library doesn't have high-level abstractions for training models like Hugging Face, there is an example implementation of LoRA that can be readily hacked and adapted for another use case.
This is exactly what I do in the example below.
Example Code: Fine-tuning Mistral 7b Instruct
This example is similar to one from a previous article. However, instead of using Hugging Face's Transformers library and Google Colab, I will use the MLX library and my local machine (2020 Mac Mini M1 16GB).
Similar to the previous example, I will be fine-tuning a quantized version of Mistral-7b-Instruct to respond to YouTube comments in my likeness. I use the QLoRA parameter efficient fine-tuning method. If you are unfamiliar with QLoRA, I have an overview of the method here.