Build a Convolutional Neural Network from Scratch using Numpy

Author:Murphy  |  View: 20302  |  Time: 2025-03-22 23:57:43
These colored windows remind me of the layers of CNNs and their filters. Image source: unsplash.com.

As Computer Vision applications are now present everywhere in our daily lives, it is fundamental for every Data Science practitioner to understand their functioning principles and familiarize themselves with them.

In this article, I built a Deep Neural Network without relying on popular modern deep learning libraries like Tensorflow, Pytorch, and Keras. I then classified images of handwritten digits with it. While the achieved results didn't reach state-of-the-art levels, they were nevertheless satisfactory. Now, I want to take a further step in developing a Convolutional Neural Network (CNN) using only the Python library Numpy.

Python deep learning libraries, like the ones mentioned above, are extremely powerful tools. However, as a downside, they shield Data Science practitioners from understanding the low-level functioning principles of Neural Networks. This is especially true with CNNs, as their processes are less intuitive compared with the classical fully connected networks. The only way to address this issue is to get our hands dirty and implement CNNs ourselves: this is the motivation behind this task.

This article is intended as a practical, hands-on guide rather than a comprehensive guide of CNN functioning principles. As a consequence, the theoretical part is concise and mostly serves the understanding of the practical section. For this reason, you will find an exhaustive list of resources at the end of this post. I warmly invite you to check them out!

Convolutional Neural Networks

Convolutional Neural Networks use a specific architecture and operations that make them well-suited for tasks related to images, such as image classification, object localization, image segmentation, and more. Their design roughly mirrors the human visual cortex, where each biological neuron responds to only a small portion of the visual field. Moreover, higher-level neurons react to the outputs of lower-level neurons.

While classical fully connected networks can handle image-related tasks, their effectiveness degrades significantly when applied to medium or large images due to the large number of parameters they require. For instance, a 200×200 pixel image contains 40,000 pixels, and if the first layer of the network has 1,000 units, it results in 40 million weights just for that layer. This challenge is highly alleviated by CNNs as they implement partially connected layers and weight sharing.

The main components of a Convolutional Neural Network are:

  • Convolutional layers
  • Pooling layers

Convolutional Layer

A convolutional layer consists of a set of filters, also known as kernels. When applied to the input of the layer, these filters modify the original images in specific ways.

A filter can be described as a matrix, whose elements values define the king of modification applied to the original image. For instance, a 3×3 kernel, like the following one, highlights the vertical edges of the image:

This kernel instead accentuates the horizontal edges:

Source: Wikipedia.

It is important to note that the values of the elements of these kernels are not manually chosen but are parameters that the network learns during the training process.

The main function of convolutions is to isolate and highlight the different features present in the image. Later on, dense layers will use these features.

Pooling Layer

Pooling layers are more simple than convolutional layers. Their purpose is to minimize the computational load and memory usage of the network. They achieve this task by downsizing the input image's dimensions. Reducing the dimension results in a reduction of the number of parameters that the CNN has to learn.

Pooling layers also employ a kernel, typically of dimension 2×2, to aggregate a section of the input image into a single value. For example, a 2×2 max pooling kernel extracts 4 pixels from the input image and outputs only the pixel with the maximum value.

Python Implementation

You can find all the code shown in this section in my GitHub repository.

GitHub – andreoniriccardo/CNN-from-scratch: Convolutional Neural Network from scratch

The concept behind this implementation consists of creating Python classes that represent the convolutional and max pooling layers. Furthermore, as this CNN will be applied to the famous open-source MNIST dataset, I also create a specific class for the Softmax layer.

Within each class, I define the methods that perform the forward propagation and backpropagation steps.

As a final step, the layers are appended into a list to build the final Convolutional Neural Network.

Convolutional Layer Implementation

The code defining a convolutional layer is the following:

class ConvolutionLayer:
    def __init__(self, kernel_num, kernel_size):
        self.kernel_num = kernel_num
        self.kernel_size = kernel_size        
        self.kernels = np.random.randn(kernel_num, kernel_size, kernel_size) / (kernel_size**2)

    def patches_generator(self, image):
        image_h, image_w = image.shape
        self.image = image
        for h in range(image_h-self.kernel_size+1):
            for w in range(image_w-self.kernel_size+1):
                patch = image[h:(h+self.kernel_size), w:(w+self.kernel_size)]
                yield patch, h, w

    def forward_prop(self, image):
        image_h, image_w = image.shape
        convolution_output = np.zeros((image_h-self.kernel_size+1, image_w-self.kernel_size+1, self.kernel_num))
        for patch, h, w in self.patches_generator(image):
            convolution_output[h,w] = np.sum(patch*self.kernels, axis=(1,2))
        return convolution_output

    def back_prop(self, dE_dY, alpha):
        dE_dk = np.zeros(self.kernels.shape)
        for patch, h, w in self.patches_generator(self.image):
            for f in range(self.kernel_num):
                dE_dk[f] += patch * dE_dY[h, w, f]
        self.kernels -= alpha*dE_dk
        return dE_dk

The constructor of the ConvolutionLayer class takes as inputs the number of kernels of the convolutional layer and their size. I assume to use only squared kernels of size kernel_size by kernel_size.

Later, I generate random filters of shape (kernel_num, kernel_size, kernel_size) and, for normalization, I divide each element by the squared kernel size.

The patches_generator() method is a generator. It yields the portions of the images on which to perform each convolution step.

The forward_prop() method carries out the convolution for each patch generated by the method above.

Finally, the back_prop() method is responsible for computing the gradient of the loss function with respect to each layer's weight. It also updates the weights' values correspondingly. Note that the loss function mentioned here is not the global loss of the network. Instead, it consists of the loss function passed by the max pooling layer to the previous convolutional layer.

To demonstrate the actual effect of this class, I created an instance of the ConvolutionLayer with 32 filters, each of size 3×3. Then I apply the forward propagation method on an image, resulting in an output consisting of 32 slightly smaller images.

The initial input image has size 28×28 pixels and is depicted below:

Image by the author.

Once I applied the forward_prop() method of the convolutional layer, I obtain 32 images of size 26×26 pixels. One of them is the following:

Image by the author.

As you can see, the image has been reduced in size, and the clarity of the handwritten digit is worse. It is important to note that this operation was carried out by a filter containing random values, and therefore, it does not accurately represent the actual step performed by a trained CNN. Still, you can grasp the idea of how these convolutions yield smaller images where the distinctive features of the object are isolated.

Max Pooling Layer Implementation

I used Numpy to define the Max Pooling layer class as follows:

class MaxPoolingLayer:
    def __init__(self, kernel_size):
        self.kernel_size = kernel_size

    def patches_generator(self, image):
        output_h = image.shape[0] // self.kernel_size
        output_w = image.shape[1] // self.kernel_size
        self.image = image

        for h in range(output_h):
            for w in range(output_w):
                patch = image[(h*self.kernel_size):(h*self.kernel_size+self.kernel_size), (w*self.kernel_size):(w*self.kernel_size+self.kernel_size)]
                yield patch, h, w

    def forward_prop(self, image):
        image_h, image_w, num_kernels = image.shape
        max_pooling_output = np.zeros((image_h//self.kernel_size, image_w//self.kernel_size, num_kernels))
        for patch, h, w in self.patches_generator(image):
            max_pooling_output[h,w] = np.amax(patch, axis=(0,1))
        return max_pooling_output

    def back_prop(self, dE_dY):
        dE_dk = np.zeros(self.image.shape)
        for patch,h,w in self.patches_generator(self.image):
            image_h, image_w, num_kernels = patch.shape
            max_val = np.amax(patch, axis=(0,1))

            for idx_h in range(image_h):
                for idx_w in range(image_w):
                    for idx_k in range(num_kernels):
                        if patch[idx_h,idx_w,idx_k] == max_val[idx_k]:
                            dE_dk[h*self.kernel_size+idx_h, w*self.kernel_size+idx_w, idx_k] = dE_dY[h,w,idx_k]
            return dE_dk

The constructor method only assigns the kernel size value. The following methods operate similarly to the ones defined for the convolutional layer, with the main difference being that the back_prop() method doesn't update any weight values. In fact, the pooling layer doesnt' rely on weights to perform the aggregation.

Softmax Layer Implementation

Finally, I define the Softmax layer. It has the objective of flattening the output volume obtained from the final max pooling layer. The Softmax layer outputs 10 values, which can be interpreted as the probability of an image corresponding to the 0-to-9 digits.

The implementation has the same structure of the ones seen above:

class SoftmaxLayer:
    def __init__(self, input_units, output_units):
        self.weight = np.random.randn(input_units, output_units)/input_units
        self.bias = np.zeros(output_units)

    def forward_prop(self, image):
        self.original_shape = image.shape
        image_flattened = image.flatten()
        self.flattened_input = image_flattened
        first_output = np.dot(image_flattened, self.weight) + self.bias
        self.output = first_output
        softmax_output = np.exp(first_output) / np.sum(np.exp(first_output), axis=0)
        return softmax_output

    def back_prop(self, dE_dY, alpha):
        for i, gradient in enumerate(dE_dY):
            if gradient == 0:
                continue
            transformation_eq = np.exp(self.output)
            S_total = np.sum(transformation_eq)

            dY_dZ = -transformation_eq[i]*transformation_eq / (S_total**2)
            dY_dZ[i] = transformation_eq[i]*(S_total - transformation_eq[i]) / (S_total**2)

            dZ_dw = self.flattened_input
            dZ_db = 1
            dZ_dX = self.weight

            dE_dZ = gradient * dY_dZ

            dE_dw = dZ_dw[np.newaxis].T @ dE_dZ[np.newaxis]
            dE_db = dE_dZ * dZ_db
            dE_dX = dZ_dX @ dE_dZ

            self.weight -= alpha*dE_dw
            self.bias -= alpha*dE_db

            return dE_dX.reshape(self.original_shape)
Image by the author.

Conclusions

In this post, we saw a theoretical introduction to the fundamental CNN architectural elements such as convolutional and pooling layers. I am positive that the step-by-step Python implementation will provide you a practical understanding of how these theoretical concepts can be translated into code.

I invite you to clone the GitHub repository containing the code and play with the main.py script. Of course, this network doesn't achieve state-of-the-art performances, as it is not built for this objective, but nevertheless reaches a 96% accuracy after a few epochs.

Finally, in order to expand your knowledge about CNN and computer vision, I suggest checking some of the resources listed below.


If you liked this story, consider following me to be notified of my upcoming projects and articles!

References

Tags: Convolutional Network Data Science Hands On Tutorials Machine Learning Programming

Comment