Efficient Model Fine-Tuning with Bottleneck Adapter

Author:Murphy  |  View: 22899  |  Time: 2025-03-22 23:57:58
Photo by Karolina Grabowska: https://www.pexels.com/photo/set-of-modern-port-adapters-on-black-surface-4219861/

Fine-tuning is one of the most common things that we can do to gain better performance from a Deep Learning model on our specific task. The time we need to fine-tune a model normally corresponds to its size: the bigger the size of the model, the longer the time needed to fine-tune it.

I think we can agree that nowadays, deep learning models such as Transformer-based models are becoming increasingly sophisticated. Overall, this is a good thing to see but it comes with a caveat: they tend to have huge number of parameters. Thus, fine-tuning large models is becoming more challenging to manage and we need a more efficient way to do it.

In this article, we're going to discuss one of several efficient fine-tuning methods called bottleneck adapter. Although you can apply this method to any deep learning model, we'll only focus our attention to its application on Transformer-based models.

The structure of this article is as follows: first, we're going to do a normal fine-tuning of a BERT model on a specific dataset. Then, we will insert some bottleneck adapters into our BERT model with the help of adapter-transformers library to see how they can help us to make fine-tuning process more efficient.

Before we fine-tune the model, let's start with the dataset that we're going to use.


About the Dataset

The dataset we about to use contains different types of text related to mental health collected from Reddit (licensed under CC-BY-4.0). The dataset itself is suitable for text classification tasks, where we can predict whether any given text has a depressive sentiment in it or not. Let's take a look at a sample of it.

!pip install datasets

from datasets import load_dataset

dataset = load_dataset("mrjunos/depression-reddit-cleaned")
print(dataset['train'][2])

'''
{'text': 'anyone else instead of sleeping more when depressed stay up all night to avoid the next day from coming sooner may be the social anxiety in me but life is so much more peaceful when everyone else is asleep and not expecting thing of you',
 'label': 1}
'''

As you can see, the dataset is very straightforward as we only have two fields: one for the text and another for the label. The label itself contains only two possible values: 1 if the text has a depressive sentiment and 0 otherwise. Our task is to fine-tune a pretrained BERT model to predict the sentiment of each text.

In total, there are around 7731 texts, and we're going to use 6500 of them for training and the rest 1231 for validation during fine-tuning process.

Let's create a dataloader to load the dataset in batch during fine-tuning process that we'll see in the next section:

!pip install pip install adapter-transformers

import torch
import numpy as np
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

class Dataset(torch.utils.data.Dataset):

    def __init__(self, input_data):
        self.labels = [data for data in input_data['label']]
        self.texts = [tokenizer(data,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for data in input_data['text']]

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y

Now that we have our data, we can start to talk about the main topic of this article. However, the concept behind bottleneck adapters will be easier to understand if we already familiar with the standard process of a normal fine-tuning.

Thus, in the next section, we'll start with the concept of a normal fine-tuning process before expanding the concept to the application of bottleneck adapters.

We'll be using adapter-transformers library to do both normal fine-tuning and adapter-based fine-tuning. This library is a direct fork from the famous transformers library from HuggingFace, which means that it contains all of the functionality of transformers with several addition of model classes and methods so that we can apply adapters into models easily.

You can install adapter-transformers with the following command:

pip install adapter-transformers

Now let's start with the common procedure of a normal fine-tuning.


Normal BERT Fine-Tuning

Fine-tuning is a common technique in deep learning in order to gain better performance from a pretrained model on a specific data and/or task. The main idea is simple: we take the weights of a pretrained model, and then update those weights based on a new domain-specific data.

Normal fine-tuning process. Image by author.

The common procedure of a normal fine-tuning is as follows.

First, we pick a pretrained model, which in our case would be a BERT-base model. As a side note, we're not going to focus our attention on BERT in this article, but if you're new to BERT and want to find out more about it, you can check my article that talks about BERT here:

Text Classification with BERT in PyTorch

In a nutshell, BERT-base contains 12 stacks of Transformer-encoder layer. During fine-tuning process, we need to add a linear layer on top of the last stack that will act as a classifier. Since the label in our dataset only consists of two possible values, then the output of our linear layer will also be two.

from torch import nn
from transformers import BertForSequenceClassification

class BertClassifier(nn.Module):

    def __init__(self, model_id='bert-base-cased', num_class=2):
        super(BertClassifier, self).__init__()
        self.bert = BertForSequenceClassification.from_pretrained(model_id, num_labels=num_class)

    def forward(self, input_id, mask):
        output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        return output
BERT architecture. Image by author.

Now that we have defined our model, we need to create the fine-tuning script. Below is the code snippet to fine-tune the model on our dataset.

from torch.optim import Adam
from tqdm import tqdm

def train(model, train_data, val_data, learning_rate, epochs):

    # Fetch training and validation data in batch
    train, val = Dataset(train_data), Dataset(val_data)
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:
       model = model.to(device)
       criterion = criterion.to(device)

    for epoch_num in range(epochs):

        total_acc_train = 0
        total_loss_train = 0

        # Fine-tune the model
        for train_input, train_label in tqdm(train_dataloader):

            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].squeeze(1).to(device)

            output = model(input_id, mask)[0]

            batch_loss = criterion(output, train_label.long())
            total_loss_train += batch_loss.item()

            acc = (output.argmax(dim=1) == train_label).sum().item()
            total_acc_train += acc

            model.zero_grad()
            batch_loss.backward()
            optimizer.step()

        total_acc_val = 0
        total_loss_val = 0

        # Validate the model
        with torch.no_grad():

            for val_input, val_label in val_dataloader:
                val_label = val_label.to(device)
                mask = val_input['attention_mask'].to(device)
                input_id = val_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)[0]

                batch_loss = criterion(output, val_label.long())
                total_loss_val += batch_loss.item()
                acc = (output.argmax(dim=1) == val_label).sum().item()
                total_acc_val += acc

        print(
            f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train): .3f} 
            | Train Accuracy: {total_acc_train / len(train): .3f} 
            | Val Loss: {total_loss_val / len(val): .3f} 
            | Val Accuracy: {total_acc_val / len(val): .3f}')

We will fine-tune our BERT model for around 10 epochs and the learning rate is set to 10e-7. I fine-tuned the model on a T4 GPU with batch size of 2. Below is a snapshot of how the training and validation accuracy should look like.

EPOCHS = 10
LR = 1e-7

model = BertClassifier()
data = dataset['train'].shuffle(seed=42)
train(model, data[:6500], data[6500:], LR, EPOCHS)

100%|███████████████████████████████████ 3250/3250 [11:56<00:00,  4.54it/s]
Epochs: 1 | Train Loss:  0.546 | Train Accuracy:  0.533 | Val Loss:  0.394 | Val Accuracy:  0.847
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 2 | Train Loss:  0.302 | Train Accuracy:  0.888 | Val Loss:  0.226 | Val Accuracy:  0.906
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 3 | Train Loss:  0.184 | Train Accuracy:  0.919 | Val Loss:  0.149 | Val Accuracy:  0.930
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 4 | Train Loss:  0.122 | Train Accuracy:  0.946 | Val Loss:  0.101 | Val Accuracy:  0.955
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 5 | Train Loss:  0.084 | Train Accuracy:  0.963 | Val Loss:  0.075 | Val Accuracy:  0.968
100%|███████████████████████████████████ 3250/3250 [11:56<00:00,  4.53it/s]
Epochs: 6 | Train Loss:  0.063 | Train Accuracy:  0.969 | Val Loss:  0.061 | Val Accuracy:  0.970
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 7 | Train Loss:  0.050 | Train Accuracy:  0.974 | Val Loss:  0.054 | Val Accuracy:  0.973
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 8 | Train Loss:  0.042 | Train Accuracy:  0.978 | Val Loss:  0.049 | Val Accuracy:  0.972
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 9 | Train Loss:  0.035 | Train Accuracy:  0.982 | Val Loss:  0.047 | Val Accuracy:  0.973
100%|███████████████████████████████████ 3250/3250 [11:57<00:00,  4.53it/s]
Epochs: 10 | Train Loss: 0.030 | Train Accuracy:  0.984 | Val Loss:  0.046 | Val Accuracy:  0.966

And that's it! We achieved a validation accuracy of 97.3% with BERT on our dataset. We can then proceed to use the fine-tuned model to make a prediction on unseen data.

Overall, normal fine-tuning of a pretrained model wouldnt'be a problem if our model has ‘small number of parameters, as we can see with BERT model above. Let's check the total number of parameters that our BERT-base model has.

def print_trainable_parameters(model):

    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)
'''
trainable params: 108311810 || all params: 108311810 || trainable%: 100.0
'''

This model has in total close to 110 million parameters. Although it looks like a lot, but it's still nothing compared to most Large Language Models nowadays, as they can have billions of paramaters. If you notice as well, the number of trainable parameters is the same as the total number of parameters of our BERT model. This means that during a normal fine-tuning process, we update the weight of all of the parameters of our BERT model.

With the help of a T4 GPU and the fact that our training dataset contains only 6500 entries, we fortunately only need around 12 minutes per epoch to update all of the weights. Now imagine if we use larger models and larger dataset, the computation time to do normal fine-tuning would be costly.

Also, normal fine-tuning is commonly associated with a risk of the so-called catastrophic forgetting if we're not careful with how we choose a learning rate, or when we try to fine-tune a pretrained model on several tasks/datasets. Catastrophic forgetting refers to a condition when a pretrained model ‘forgets‘ the task it was trained on when we fine-tune it on a new task.

Thus, we definitely need a more efficient procedure to do a fine-tuning process. And this is where we can use different types of efficient Fine Tuning method, with bottleneck adapter being one of them.


How Bottleneck Adapter Works

The main idea behind an adapter is that we introduce small subset of layers and place them inside of the original architecture of pretrained models. During the fine-tuning process, we freeze all of the parameters of original pretrained models and thus, only the weights of these additional subset of layers will be updated.

Bottleneck adapter specifically is an adapter that consists of two normal feed-forward layers, with an optional normalization layer before and/or after them. The functionality of one feed-forward layer is to downscale the output, while the other is to upscale the output. This is why this adapter is called bottleneck adapter.

Common bottleneck adapters. Image by author.

You can apply this adapter on any deep learning model, but as mentioned earlier, we'll focus our attention to its application on Transfomer-based models.

Transformer-based models normally consist of several stacks of Transformer layers. BERT-based model that we use in this article, for example, has 12 stacks of Transformer-encoder layer. Each stack consists of the following components:

Transformer encoder stack. Image by author.

There are several different ways we can put a bottleneck adapter into this stack. However, there are two common configurations in which the adapter can be inserted: one is proposed by Pfeiffer and another is proposed by Houlsby.

The bottleneck adapter proposed by Pfeiffer is inserted after the final norm layer, while the bottleneck adapter proposed by Houlsby is inserted in two different places: one after multi-head attention layer and another after the feed-forward layer, as you can see from the visualization below:

Difference between Pfeiffer and Houlsby adapter configuration. Image by author.

Since our BERT-base model has 12 stacks of Transformer-encoder layer, then we will have a total of 12 bottleneck adapters if we use Pfeiffer configuration: one adapter in each stack. Meanwhile, we'll have a total of 24 bottleneck adapters with Houlsby configuration: two adapters in each stack.

Although Pfeiffer configuration leads to fewer parameters compared to Houlsby, their performances have shown to be on par with each other across 8 different tasks.

Now the question is: how does this bottleneck adapter make fine-tuning process more efficient?

As mentioned previously, we freeze the weights of our pretrained model and only update the weights of our adapter during fine-tuning process. This means that we can speed up the fine-tuning process by considerable margin and we will see this in the next section. The experiments have also shown that the performance of fine-tuning with adapters are mostly comparable with normal fine-tuning.

Also, imagine the situation where we want to use the same pretrained model for two different datasets. Instead of having two copies of the same model fine-tuned on different datasets to avoid the risk of catastrophic forgetting, we can have just one model with two different adapters fine-tuned on different datasets.

Image by author.

With this approach, we save a lot of storage space. As an illustration, the size of a single BERT-base model is 440 MB, which translates to 880 MB if we have two models. Meanwhile, if we have one BERT-base model with two adapters, the size would only be roughly around 450 MB, since bottleneck adapters only take small size of memory.


Bottleneck Adapter Implementation

In this section, we'll implement Pfeiffer's version of bottleneck adapter. To do this, we only need to change the script of model architecture, while the scripts related to fine-tuning process and dataloading remain the same.

Let's define the model architecture with Pfeiffer's adapter.

from Transformers import AdapterConfig
from transformers.adapters import BertAdapterModel

class BertClassifierWithAdapter(nn.Module):

    def __init__(self, model_id='bert-base-cased', adapter_id='pfeiffer', 
                task_id = 'depression_reddit_dataset', num_class=2):

        super(BertClassifierWithAdapter, self).__init__()

        self.adapter_config = AdapterConfig.load(adapter_id)

        self.bert = BertAdapterModel.from_pretrained(model_id)
        # Insert adapter according to configuration
        self.bert.add_adapter(task_id, config=self.adapter_config)
        # Freeze all BERT-base weights 
        self.bert.train_adapter(task_id)
        # Add prediction layer on top of BERT-base
        self.bert.add_classification_head(task_id, num_labels=num_class)
        # Make sure that adapters and prediction layer are used during forward pass
        self.bert.set_active_adapters(task_id)

    def forward(self, input_id, mask):

        output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)

        return output

As you can see, it's pretty straightforward to implement the adapter version of a model:

  • Define the adapter configuration we want to apply with AdapterConfig.load('pfeiffer') . If you want to use Houlsby configuration, just change it to 'houlsby' .
  • Insert adapters into our BERT model with add_adapter() method. Common practice is to give adapters the name according to task or dataset that we want our model to be fine-tuned on.
  • Freeze all the weights of pretrained model with train_adapter() method.
  • Add a linear layer that will act as a prediction head on top of our BERT model with add_classification_head() method. Common practice is to give prediction head the same name as our adapters.
  • Activate our adapters and prediction head to make sure that they're used in every forward pass with set_active_adapters() method.

Now let's check the total number of parameters and the proportion of trainable parameters after the inclusion of adapters:

# Initialize model 

# task_id is the name of our adapter. You can name it whatever you want but
# common practice is to name it according to task/dataset we will train it on.
task_name = 'depression_reddit_dataset'
model_adapter = BertClassifierWithAdapter(task_id=task_name)
# Check parameters
print_trainable_parameters(model_adapter)

'''
trainable params: 1486658 || all params: 109796930 || trainable%: 1.3540068925424418
'''

The model with adapters have more parameters than our original BERT-base model, but only 1.35% of them are trainable, since we'll only update the weights of our adapters.

Now it's time to train the model. Since the weights of our adapter are initialized randomly, then we'll go with a slightly higher learning rate this time. We'll also train this model for 10 epochs. If everything goes well, you'll get more or less similar outputs like below:

LR = 5e-6
EPOCHS = 10
train(model_adapter, dataset['train'][:6500], dataset['train'][6500:], LR, EPOCHS)

100%|███████████████████████████████████████████ 3250/3250 [07:19<00:00,  7.40it/s]
Epochs: 1 | Train Loss:  0.183 | Train Accuracy:  0.846  | Val Loss:  0.125  | Val Accuracy:  0.897
100%|███████████████████████████████████████████ 3250/3250 [07:24<00:00,  7.32it/s]
Epochs: 2 | Train Loss:  0.096 | Train Accuracy:  0.925  | Val Loss:  0.072  | Val Accuracy:  0.946
100%|███████████████████████████████████████████ 3250/3250 [07:23<00:00,  7.32it/s]
Epochs: 3 | Train Loss:  0.060 | Train Accuracy:  0.958  | Val Loss:  0.052  | Val Accuracy:  0.962
100%|███████████████████████████████████████████ 3250/3250 [07:21<00:00,  7.37it/s]
Epochs: 4 | Train Loss:  0.044 | Train Accuracy:  0.968  | Val Loss:  0.047  | Val Accuracy:  0.971
100%|███████████████████████████████████████████ 3250/3250 [07:25<00:00,  7.30it/s]
Epochs: 5 | Train Loss:  0.038 | Train Accuracy:  0.971  | Val Loss:  0.043  | Val Accuracy:  0.973
100%|███████████████████████████████████████████ 3250/3250 [07:25<00:00,  7.29it/s]
Epochs: 6 | Train Loss:  0.034 | Train Accuracy:  0.975  | Val Loss:  0.039  | Val Accuracy:  0.971
100%|███████████████████████████████████████████ 3250/3250 [07:25<00:00,  7.29it/s]
Epochs: 7 | Train Loss:  0.032 | Train Accuracy:  0.978  | Val Loss:  0.038  | Val Accuracy:  0.972
100%|███████████████████████████████████████████ 3250/3250 [07:25<00:00,  7.29it/s]
Epochs: 8 | Train Loss:  0.029 | Train Accuracy:  0.980  | Val Loss:  0.039  | Val Accuracy:  0.974
100%|███████████████████████████████████████████ 3250/3250 [07:25<00:00,  7.29it/s]
Epochs: 9 | Train Loss:  0.027 | Train Accuracy:  0.980  | Val Loss:  0.035  | Val Accuracy:  0.971
100%|███████████████████████████████████████████ 3250/3250 [07:19<00:00,  7.40it/s]

As you can see, the performance of our model with adapters is comparable with the full fine-tuning version of the model. Also, the time it needs to complete one epoch is 4.5 minutes faster than full fine-tuning.

Now that we have trained it, we can save the adapter.

# Save trained model with adapter
model_adapter_path = 'model/bert_adapter/' 
model_adapter.bert.save_all_adapters(model_adapter_path)

And then we can load the adapter and use it for inference as follows:

def predict(data, model):

    inputs = tokenizer(data['text'],
              padding='max_length', max_length=512, truncation=True,
              return_tensors="pt")

    mask = inputs['attention_mask'].to(device)
    input_id = inputs['input_ids'].squeeze(1).to(device)
    output = model(input_id, mask)[0].argmax(dim=1).item()
    print(output)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

model_path = f'{model_adapter_path}{task_name}'

# Load trained adapter
trained_adapter_model = BertClassifierWithAdapter(task_id=task_name)
adapter_name = trained_adapter_model.bert.load_adapter(model_path)
trained_adapter_model.bert.set_active_adapters(adapter_name)
trained_adapter_model.to(device)
trained_adapter_model.eval()

# Predict test data
predict(data[6900], trained_adapter_model)

Conclusion

In this article, we've seen how bottleneck adapters can be helpful in fine-tuning process of a large model. With bottleneck adapters, we're able to speed up fine-tuning while still maintaining the end performance of the model. These adapters are also useful to avoid the risk of catastrophic forgetting commonly associated to a fine-tuned model. Moreover, these adapters don't take a lot of space in the memory.

I hope this article is helpful for you to get your hands dirty with bottleneck adapters. If you want to access all of the code implemented in this article, you can check it out via this notebook.


Dataset Reference

Tags: Deep Learning Fine Tuning Hands On Tutorials Llm Transformers

Comment