Convolutional Neural Networks For Beginners

I wrote this post as part of my preparation for one of the lectures I taught at Interview Kickstart to prepare professionals to land jobs in top tech companies. If you are in the process of preparing for interviews or just strengthening your foundation, this post might help you too.
In this post, we look into convolutional neural networks and their basics and fundamentals. We will start from what a convolution operation is, and continue with what a convolution layer is and how convolutional networks are built.
Let's get started.
Convolutional Neural Networks (CNNs) consist of several "convolutional layers". These layers run the "convolution operation." Convolution is a fundamental operation in signal and image processing. Let's first see what this operation is.
What is a convolution operation?
Convolution is the mathematical operation between a kernel (filter) and an input feature map.
The kernel is usually a small matrix e.g. 3×3, or 5×5. The input is always a feature map with height, width, and channels. How convolution operation works is that the kernel slides over the input and computes the dot product between the kernel and local regions of input. This dot product multiplication and summation produces a single value in the output feature map.
As the filter slides over all locations, it generates a 2D activation map called output feature map. For each slide of filter (kernel) over the image or the input feature map, we compute the element-wise dot-product and sum them together. This gives one entry in the output map:

Next, we slide the filter to the right over another local region of the input map and it produces another entry into the output map:

We slide it all to the right, until we no longer can slide it to the right, then we move back to the left-most side and slide it down by one entry:

The last convolutional operation for this example would be at the right most bottom point:

Now, let's take a look at the convolutional layer, which is the key building block of a convolutional neural network (CNN).
What is a convolutional layer?
A convolutional layer like any other layer consists of multiple neurons. Each neuron in the convolutional layer has a set of weights defining its filter (kernel). This filter is convolved with the input (or the output of the previous layer) to generate a 2-dimensional activation map.
During forward pass, the input to a neuron in a convolutional layer is a 3D volume with dimensions [height, width, channels]. Each neuron in the layer has a set of weights defining its convolutional kernel (filter).
This filter has a small spatial extent (e.g. 3×3) but extends fully along the input depth. Meaning that it is convolved (element-wise multiplied and summed) with the input and extends fully along the depth, and produces a 2D activation map as output. The dimensions of output for each neuron is [height, width, 1].
If we stack together the output map of all neurons in one convolutional layer, then we will have an output of dimension [height, width, channel]; where channel is the number of neurons in the layer. Channel is the depth of the outcome.
Now, lets look at parameters and hyper-parameters of this layer:
Parameters: A layer's parameters are the weights of the kernels that are initially initialized at random values, but during training they are learned and optimized.
Hyper-parameters: A layer's hyper-parameters are the followings:
- 1) number of neurons in the layer
- 2) filter size (e.g. 2×2, 3×3); usually all neurons in a layer have same size filters
- 3) Stride: that is the number of steps to move the filter at each time to either right or down.
- 3) Zero-padding: **** Zero-padding refers to the technique where the input to a convolutional layer is padded with zeros around its border. This helps preserve spatial resolution throughout the network; otherwise spatial information can otherwise get lost very quickly as depth increases. There are two main types of zero-padding:
- 3–1) Same padding – Pad so that the output size matches the input size. Requires padding of _(kernelsize – 1) / 2.
- 3–2) Valid padding – Use no padding. Output shrinks by _kernelsize – 1.
So if we are using a 3×3 kernel, we will zero-pad by 1 pixel on each dimension of the input. See the example below, where the input image is 5×5 and kernel is 3×3. Without zero padding the output activation map is 3×3. As we see the spatial dimensions are not preserved.

But if we do zero padding, we preserve the spatial dimension of the input image. As we see below, the output activation map is 5×5.

What is a convolutional neural network?
A Convolutional Neural Network (CNN) is a neural network that consists of multiple convolutional layers, pooling layers and fully connected layers. It is often to process grid-like topology, such as images.
The convolutional layers as we saw above apply a convolution operation to the input using a set of learnable filters. They together build feature map hierarchies.
The pooling layers (max pooling or average pooling) downsample the feature maps to reduce computation.
The fully-connected layers are often the last layers in a CNN where they connect all the neurons between layers and perform classification.
Here we see an image of a CNN:

This concludes topic of this post. In the next posts we will look at some famous CNN model architectures for image classification tasks.
Key Takeaways
- Some terminology: every channel is called a feature map too. A kernel is called a filter too. The receptive field is the region in the input image that a neuron is looking and is extracting features from.
- All neurons in a convolutional layer have the same kernel (filter) size. This is important so that we can stack together all output activation maps.
- CNNs can easily scale to large images as convolutional filters are only applied locally.
- Number of neurons in a convolutional layer is the same as the number of filters or kernels. They define the number of channels in the output feature map.
- Neurons in a convolutional layer have independent kernels; they do not share weights.
Summary
Convolutional Neural Networks (CNN) are grid-processing neural networks that consist of convolutional layers, pooling layers and fully connected layers. Each neuron in a convolutional layer corresponds to a filter (kernel). During forward pass, each filter is convolved with the input volume to produce a 2D activation map of that filter. Multiple filters produce multiple activation maps stacked in the depth dimension. Then pooling layer which is either max pooling or average pooling downsample this to a smaller size. At the end of the neural network there are one or few fully connected layers to help with image classification task.
If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/