Applying Hyperparameter Tuning To Neural Networks

Background
In my previous post, we discussed how neural networks predict and learn from the data. There are two processes responsible for this: the forward pass and backward pass, also known as backpropagation. You can learn more about it here:
This post will dive into how we can optimise this "learning" and "training" process to increase the performance of our model. The areas we will cover are computational improvements and hyperparameter tuning and how to implement it in PyTorch!
But, before all that good stuff, let's quickly jog our memory about neural networks!
Quick Recap: What are Neural Networks?
Neural networks are large mathematical expressions that try to find the "right" function that can map a set of inputs to their corresponding outputs. An example of a neural network is depicted below:

Each hidden-layer neuron carries out the following computation:

- Inputs: These are the features of our dataset.
- Weights: Coefficients that scale the inputs. The goal of the algorithm is to find the most optimal coefficients through gradient descent.
- Linear Weighted Sum: Sum up the products of the inputs and weights and add a bias/offset term, b.
- Hidden Layer: Multiple neurons are stored to learn patterns in the dataset. The superscript refers to the layer and the subscript to the number of neuron in that layer.
- Arrows: These are the weights for the network from their corresponding inputs, whether it be the features or the hidden layer outputs. I have omitted writing them explicitly on the diagram them for a cleaner plot.
- ReLU Activation Function: The most popular activation function as it is computationally efficient and intuitive. See more about it here.
I have linked a fantastic video here explaining how neural networks can learn anything for some more context!
If you want a more thorough introduction, check my previous post here:
Computational Improvements
Perhaps the main optimisation technique that made neural networks so commonly accessible nowadays is parallel processing.
The volume of data is also another main reason why neural networks are so effective in practice.
Each layer can be formulated as a large matrix with its associated inputs, weights, and biases. For example, consider this basic output from a neuron:

Here x is the inputs, w is the weights, b is the bias and z is the final output. The above formula can be re-written in matrix form:

Without this vectorised implementation, the runtime for training neural networks would be enormous as we would start to rely on loops. In that case, we would multiply each weight and input and then add them one after the other to get z. Whereas vectorising the implementation, it can be done in one whole step.
There is a great article linked here and a video here that compares the runtime for a vectorised approach vs using loops.
Most deep learning frameworks like PyTorch and TensorFlow do this for you under the hood, so you don't have to worry too much about it!
Hyperparameters
Overview
The search space for neural network architectures and parameters is unimaginably huge, if not infinite. Several libraries exist to help you tune the parameters such as hyperopt, optuna or just regular sci-kit learn.
There are also many more libraries out there, see here for a list.
They differ in their approach some use a simple grid search or random search, whereas others have more sophisticated methods such as Bayesian optimisation or even evolutionary algorithms like genetic algorithms. One method is not better than the other, it all comes down to how you devise your search space and compute resources.
It is important to have some knowledge of your ideal parameters to not waste a significant amount of time searching over unnecessary values and converge quickly. Let's now run through some of the main ones you ought to look at!
Number of Hidden Layers
There is something called the _universal approximation theorem_ that basically says a single hidden layer neural network can learn any function, provided it has enough neurons.
However, having one layer with loads of neurons is not ideal, and it is better to have several layers with fewer neurons. The hypothesis for this being is that each layer learns something new and at a more granular level. Whereas with one hidden layer, the network is expected to learn every nuance of the dataset all at once.
In general, a couple of hidden layers are often good enough. For example, on the MNIST dataset, a model with one hidden layer and a couple of hundred neurons has 97% accuracy, but a network with two hidden layers and the same number of neurons has 98% accuracy.
The MNIST dataset contains many examples of handwritten digits.
Of course, like any hyperparameter, we should apply some tuning process to iterate over a different number of layers in combination with other hyperparameters.
Number of Neurons in Layers
The input and output layers are pre-defined on the number of neurons they can have. The input layer size must equal the number of features in the dataset. If your dataset has 50 features, then your input layer will have 50 neurons. Similarly, the output layer needs to be appropriate for the problem. If you are predicting house prices, then the output layer will only have 1 neuron. However, if you are trying to classify single digits, like in the MNIST dataset, then you need 10 output neurons.
For the hidden layers, you can go a bit wild! The search space of the number of neurons is massive. However, it's best to overkill it slightly with how many you have and use a technique such as _early stopping_ to prevent overfitting.
Another key idea is to ensure that each layer has enough neurons to have representational power. If you are trying to predict a 3D image, with 2 neurons you can only work in 2D, hence will lose some information about the signal.
Learning Rate
The learning rate determines how quickly the algorithm can converge to the optimal as it's responsible for the step size used during backpropagation. It's probably one of the most important hyperparameters for training neural networks. Too high and learning will diverge, too low, and the algorithm will take ages to converge.
I would typically just hyperparameter-tune my learning rate over a wide search space between 0.001 and 1, this is seen as the traditional learning rate that most commonly comes up in literature.
One of the best ways to find the optimal learning rate is through learning rate scheduling. The schedule reduces the learning rate as training progresses, so take smaller step sizes near and around the optimum. Let's break down some of the common ones:
Time Based Decay:
The learning rate decreases at a certain rate overtime.

Here, α is the learning rate, _α0 is the initial learning rate, _decay is the decay rate and epoch_ is the number of iterations.
An epoch is one training cycle of a neural network using all of the training data.
Step Decay:
The learning rate is reduced by a certain factor after a given number of training epochs.

Where factor is the factor that the learning rate will be reduced by and step is the number of epochs after which the learning rate should be reduced.
Exponential Decay:
The learning rate will decrease exponentially at each epoch.

Others:
There are also many other learning rate schedules such as performance scheduling, 1cycle scheduling, and power scheduling. It is important to remember that one scheduling method is not better than another and it's ideal to try several to determine which is the best fit for your model and data.
Batch Size
When training neural networks, there are three common variations of gradient descent, which are:
- Batch Gradient Descent: Use the entire training dataset to compute the gradient of the loss function. This is the most robust but not computationally feasible for large datasets.
- Stochastic Gradient Descent: Use a single data point to compute the gradient of the loss function. This method is the quickest but the estimate can be noisy and the convergence path slow.
- Mini-Batch Gradient Descent: Use a subset of the training dataset to compute the gradient of the loss function. The size of the batches varies and depends on the size of the dataset. This is the best of both worlds of both batch and stochastic gradient descent.
The tricky part is finding the best batch size to carry out mini-batch gradient descent with. It's recommended to use the largest batch sizes possible that can fit inside your computer's GPU RAM as they parallelize the computation.
Number of Iterations
This is the number of epochs, which is the total forward and backward passes that we carry out for the neural network. In reality, it's better to use early stopping and set the number of iterations arbitrarily high. This removes the chance of terminating learning too early.
Activation Functions
Most networks use the ReLU, primarily due to its computational efficiencies, but there are others out there. One of my previous posts summarised the main ones along with their pros and cons:
The selected activation function is important for the output layer to ensure your prediction is in the right context of the problem. For example, if you are predicting probability, then you would use a sigmoid activation function.
However, in my opinion, testing different activation functions in the hidden layers won't move the needle of performance that much compared to the other hyperparameters we have discussed above.
Other Hyperparameters
There are also other hyperparameters that can tuned for the model:
- Optimising algorithm (cover in next article!)
- Regularisation methods (cover in future!)
- Weight initialisation
- Loss function
Python Example
Below is some boilerplate code that carries out Hyperparameter Tuning for a neural network in PyTorch using hyperopt for the MNIST dataset:
The code is available on my GitHub here:
Medium-Articles/Neural Networks/hyperparam_tune.py at main · egorhowell/Medium-Articles
Summary & Further Thoughts
Neural networks have many hyperparameters and infinite architectures, this makes finding the best combination very difficult. Fortunately, packages such as optuna and hyperpot exist that carry out this process for us in a smart way. The hyperparameters that are often best to tune are the number of hidden layers, the number of neurons, and the learning rate. These often give us the most ‘bang for our buck' when developing neural net models. The number of epochs is made redundant through the use of early stopping, and the activation function chosen also often has a minimal effect on the performance. However, it is always important to consider what type of problem you are trying to solve when considering the structure of your input and output layers as well as the output layer's activation function.
References & Further Reading
- Andrej Karpathy Neural Network Course
- PyTorch site
- Another example of training a neural network by hand
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. Aurélien Géron. September 2019. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492032649.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.