An Implementation of VGG

Author:Murphy | View: 27020 | Time: 2025-03-23 12:17:14

In this post, we look at a VGG implementation and its training on STL10 [2, 3] dataset.

We reviewed the VGG architecture in a previous post. Please take a look if you are unfamiliar.

Image Classification For Beginners

In a nutshell,

VGG stands for Visual Geometry Group and is a research group at the university of Oxford. In 2014, they designed a deep convolutional neural network architecture for image classification task and named it after themselves; i.e. VGG [1].

The VGGNet comes in few configurations such as VGG16 (with 16 layers) and VGG19 (with 19 layers).

VGG16's architecture is as below: it has 13 convolutional layers and 3 fully connected layers.

Model Implementation

Let's implement VGG16 in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt

class VGG16(nn.Module):
    def __init__(self, input_channel, num_classes):
        super(VGG16, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(input_channel, 64, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.classifier = nn.Sequential(
            nn.Linear(512 * 3* 3, 4096), nn.ReLU(True), nn.Dropout(),
            nn.Linear(4096, 4096), nn.ReLU(True), nn.Dropout(),
            nn.Linear(4096, num_classes)
        )

Note, the implementation is structured in terms of two attributes:

features: contains all the convolutional and max pool layers
classifier: contains fully connected layer and the softmax layer for classification

Also note we have passed the _inputchannel as an input argument. This parameter is 3 if the images are colored, and it is 1 if images are greyscale.

Last but not least, the first fully connected layer is nn.Linear(512 3 3, 4096). The reason its input dimension is 51233 is because we have set it up so that it works for our input images which are 96*96. If we pass images of different sizes, we have to change this value. For example, for 224224 images this layer becomes _nn.Linear(512 7* 7, 4096)._

We then implement the forward() method:

def forward(self, x):
  layer_outputs = []
  for i in range(len(self.features)):
    x = self.features[i](x)
    layer_outputs.append(x)

  x = x.view(x.size(0), -1)

  for i in range(len(self.classifier)):
    x = self.classifier[i](x)
    layer_outputs.append(x)

  return x, layer_outputs

Now that the network is complete, let's pass a random tensor through it and see how its shape changes as it goes through various layers.

vgg_model = VGG16(3, 10)
input_tensor = torch.rand(1, 3, 96, 96)
x, layer_outputs = vgg_model(input_tensor)
for l in layer_outputs:
  print(l.shape)

And it prints the following shapes:


torch.Size([1, 64, 96, 96])
torch.Size([1, 64, 96, 96])
torch.Size([1, 64, 96, 96])
torch.Size([1, 64, 96, 96])
torch.Size([1, 64, 48, 48])
torch.Size([1, 128, 48, 48])
torch.Size([1, 128, 48, 48])
torch.Size([1, 128, 48, 48])
torch.Size([1, 128, 48, 48])
torch.Size([1, 128, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 24, 24])
torch.Size([1, 256, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 12, 12])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 6, 6])
torch.Size([1, 512, 3, 3])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 4096])
torch.Size([1, 10])

So the final output is a 10-dimensional vector which represents probability of the image belonging to any of the 10 classes.

Data Transformation – STL10

Now, let's train it on STL10 dataset [2,3] which is licensed for commercial use. This dataset contains 5000 colored images in 10 categories.

Each image is 96×96 pixels, and the 10 categories are as following:

classes = ('airplane', 'bird', 'car', 'cat', 'deer', 'dog',
           'horse', 'monkey', 'ship', 'truck')

Let's load the data and take a look at few images:


transform = transforms.Compose([
    transforms.ToTensor()
])

trainset = torchvision.datasets.STL10(root = './data', split = 'train', download = True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=len(trainset))

classes = ('airplane', 'bird', 'car', 'cat', 'deer', 'dog',
           'horse', 'monkey', 'ship', 'truck')

images, target = next(iter(trainloader))

np_images = images.numpy() # convert to numpy

# display one image
plt.imshow(np.transpose(np_images[0], (1, 2, 0)))
plt.title(f'class: {classes[target[0]]}')
plt.axis('off')
plt.show()

# display another image
plt.imshow(np.transpose(np_images[1], (1, 2, 0)))
plt.title(f'class: {classes[target[1]]}')
plt.axis('off')
plt.show()

and it prints these images with their labels:

Next, let's normalize the data. To normalize the data we compute the mean and std first:

transform = transforms.Compose([
    transforms.ToTensor()
])

trainset = torchvision.datasets.STL10(root = './data', split = 'train', download = True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=len(trainset))

classes = ('airplane', 'bird', 'car', 'cat', 'deer', 'dog',
           'horse', 'monkey', 'ship', 'truck')

images, target = next(iter(trainloader))

np_images = images.numpy() # convert to numpy. 

# calculate mean and std for each channel 
mean = np.mean(np_images, axis=(0,2,3)) 
std = np.std(np_images, axis=(0,2,3))

Note that in trainloader we are setting _batchsize = len(trainset) so that we load the whole dataset for computing mean and std. Later, when we want to train the model, we load data in smaller batch of 128 images.

Notice from above that _npimages is of shape (5000, 3, 96, 96) i.e. it is 5000 images of 96×96 pixels in color scale (note the number of channels is 3 which indicates images are colored). So the mean and std are as following:

mean = [0.44671103, 0.43980882, 0.40664575]

std = [0.2603408, 0.25657743, 0.2712671

We will use this mean and std to normalize both test and train data. Let's define transformations of each dataset:

# train transformation
transform_train = transforms.Compose([
    transforms.RandomCrop(96, padding = 4), # we first pad by 4 pixels on each side then crop
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.44671103, 0.43980882, 0.40664575), (0.2603408 , 0.25657743, 0.2712671))
])

# test transformation
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.44671103, 0.43980882, 0.40664575), (0.2603408 , 0.25657743, 0.2712671))
])

trainset = torchvision.datasets.STL10(root = './data', split = 'train', download = True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size = 128, shuffle = True, num_workers = 2)

testset = torchvision.datasets.STL10(root = './data', split = 'test', download = True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size = 256, shuffle = True, num_workers = 2)In above transformation that we have defined on train data you see that we are augmenting the data by cropping a random 28x28 patch and flipping it. The reason we augment the data is to increase diversity in the training data and force the model to learn better.

Training the Model

We first define the hyper-parameters such as:

learning rate
learning rate scheduler
loss function: which is cross entropy for classification
optimizer

# instantiate the model
vgg_model = VGG16(input_channel=1, num_classes=10) 
device = 'cuda' if torch.cuda.is_available() else 'cpu'
vgg_model = vgg_model.to(device)

# define hyper-parameters: learning rate, optimizer, scheduler
lr = 0.00001
criterion = nn.CrossEntropyLoss()
vgg_optimizer = optim.SGD(vgg_model.parameters(), lr = lr, momentum=0.9, weight_decay = 5e-4)
vgg_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(vgg_optimizer, T_max = 200)And we write the train function on each batch:

Next, we define two functions:

Method 1: train_batch: for all batches in data, it trains the model, computes the loss and update the parameters. This method applies backpropagation and computes training loss.

def train_batch(epoch, model, optimizer):
    print("epoch ", epoch)
    model.train()
    train_loss = 0
    correct = 0
    total = 0

    for batch_idx, (input, targets) in enumerate(trainloader):
        inputs, targets = input.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs, _ = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
    print(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                         % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))

Method 2: is the validate_batch function where we validate the model on a batch from test loader. Often after each epoch we call this function to get the performance of model at the end of each epoch. This function computes the loss on test set (the unseen data) and DOES NOT apply any backpropagation.

def validate_batch(epoch, model):
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs,_ = model(inputs)
            loss = criterion(outputs, targets)

            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    print(batch_idx, len(testloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                 % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))

Let the actual training start …

For every epoch, we train the model and check the performance of the model on test dataset. We call _vggscheduler.step() then inform scheduler to increment its internal counter and update the learning rate.

start_epoch = 0
for epoch in range(start_epoch, start_epoch+20):
    train_batch(epoch, vgg_model, vgg_optimizer)
    validate_batch(epoch, vgg_model)
    vgg_scheduler.step()

We see the following performance:

epoch  0
390 391 Loss: 5.506 | Acc: 24.864% (12432/50000)
39 40 Loss: 4.512 | Acc: 49.780% (4978/10000)
epoch  1
390 391 Loss: 5.140 | Acc: 33.226% (16613/50000)
39 40 Loss: 4.156 | Acc: 57.120% (5712/10000)
epoch  2
390 391 Loss: 4.978 | Acc: 36.594% (18297/50000)
39 40 Loss: 3.953 | Acc: 60.450% (6045/10000)
epoch  3
390 391 Loss: 4.908 | Acc: 38.498% (19249/50000)
39 40 Loss: 3.898 | Acc: 69.430% (6943/10000)
epoch  4
390 391 Loss: 4.827 | Acc: 39.982% (19991/50000)
39 40 Loss: 3.631 | Acc: 68.240% (6824/10000)
epoch  5
390 391 Loss: 4.767 | Acc: 40.876% (20438/50000)
39 40 Loss: 3.677 | Acc: 71.260% (7126/10000)
epoch  6
390 391 Loss: 4.686 | Acc: 42.356% (21178/50000)
39 40 Loss: 3.180 | Acc: 73.560% (7356/10000)
epoch  7
390 391 Loss: 4.664 | Acc: 42.606% (21303/50000)
39 40 Loss: 3.259 | Acc: 76.920% (7692/10000)
epoch  8
390 391 Loss: 4.653 | Acc: 43.014% (21507/50000)
39 40 Loss: 3.118 | Acc: 77.150% (7715/10000)
epoch  9
390 391 Loss: 4.606 | Acc: 43.762% (21881/50000)
39 40 Loss: 2.961 | Acc: 75.850% (7585/10000)
epoch  10
390 391 Loss: 4.608 | Acc: 43.802% (21901/50000)
39 40 Loss: 2.840 | Acc: 81.130% (8113/10000)
epoch  11
390 391 Loss: 4.582 | Acc: 44.156% (22078/50000)
39 40 Loss: 2.878 | Acc: 80.810% (8081/10000)
....
...
..

We see the accuracy of the model at epoch 11 reaches 80.8% on test set.

Next, let's look at 10 images and what model has predicted for their label:

model = vgg19_model

mean = [0.44671103, 0.43980882, 0.40664575]
std = [0.2603408 , 0.25657743, 0.2712671]

# Evaluate the model on random images and display results
for _ in range(10):
    # Get a random test image
    data, target = next(iter(testloader))

    # Get model's predictions
    output, _ = model(data.to(device))
    _, predicted = torch.max(output, 1)

    # Display the image along with predicted and actual labels
    # Unnormalize the image
    display_img = data[0]
    unnormalized_image = display_img.clone()  # Create a copy to avoid modifying the original tensor
    for i in range(3):
      unnormalized_image[i] = (unnormalized_image[i] * std[i]) + mean[i]
    plt.imshow(np.transpose(unnormalized_image.numpy(), (1, 2, 0)))
    plt.title(f'Predicted: {classes[predicted[0]]}, Actual: {classes[target[0]]}')
    plt.axis('off')
    plt.show()

For example, we see the following image that is a bird and model has predicted it correctly to be a bird.

And we see an example of a wrong prediction where the image is a plane but VGG predicts it as a bird:

This concludes our implementation part on VGG model. We see that VGG has a very deep architecture with many parameters, however its implementation is quite straightforward and this is due to its uniformity in the architecture.

So far, we have reviewed VGG and ResNet in concept and only VGG in code. In the next post, we can look at ResNet implementation.

Let me know if you have any comment or questions.

If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/