Why Batch Normalization Matters for Deep Learning
Batch Normalization has become a very important technique for training neural networks in recent years. It makes training much more efficient and stable, which is a crucial factor, especially for large and deep networks. It was originally introduced to solve the problem of internal covariance shift.
This article will examine the problems involved in training neural networks and how batch normalization can solve them. We will describe the process in detail and show how batch normalization can be implemented in Python and integrated into existing models. We will also consider the advantages and disadvantages of this method to determine whether it makes sense to use it.
What Problems arise when training Deep Neural Networks?
When training a deep neural network, backpropagation occurs after each run. The prediction error runs through the network layer by layer from behind. During this process, the weights of the individual neurons are then changed so that the error is reduced as quickly as possible. This changes the weights assuming that all other layers remain the same. In practice, however, these conditions only apply to a limited extent, as all layers are changed quickly during backpropagation. The paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" describes this core problem in more detail.
The problem with this fact is that the statistical key figures of the value distribution also change with each change in the weightings. This means that after each run, the weights in a layer have a new distribution with a different mean value and a new standard deviation. This problem is known as an internal covariance shift and leads to more difficult training, as the subsequent layers cannot optimally process the changed distribution.
To still have successful training, lower learning rates must be used to minimize the fluctuations in the weight changes. However, this means that the model takes significantly longer to converge and training is slowed down overall.
What is Normalization?
The normalization of data is a process that is often used in the preparation of data sets for machine learning. The aim is to bring the numerical values of different attributes onto a common scale. For example, you can divide all numerical values in a data series by the maximum value of the data series to obtain a new data series that lies in the range from 0 to 1.
Suppose you want to train a model to learn different marketing activities and their effect on sales and the quantity sold. You could simply calculate the sum of the quantity sold and the turnover as the dependent variable. However, this can quickly lead to distorted results, for example, if you have a product series in which many products are sold, but these have a relatively low unit price. In a second series, it can be the other way around, i.e. the products are not sold as often, but have a high unit price.
A marketing campaign that leads to 100,000 additional products being sold, for example, is rated worse in the product series with low unit prices than in the product series with high unit prices. Similar problems also arise in other fields, for example when looking at the private expenditure of individuals. For two different people, food expenditure of €200 can be very different when set concerning monthly income. Normalization therefore helps to bring the data onto a neutral and comparable basis.
What is Batch Normalization?
Batch normalization was introduced to mitigate the problem of internal covariance shifts. Batches are used for this purpose, i.e. certain subsets of the data set with a fixed size, which contain a random selection of the data set in a training run. The main idea is that the activations of each layer within a batch are normalized so that they have a constant mean value of 0 and a constant standard deviation of 1. This additional step reduces the shift of the activation distributions so that the model learns faster and converges better.
How does Batch Normalization work?
When training neural networks, the entire data set is divided into so-called batches. These contain a random selection of data of a certain size and are used for a training run. In most cases, a so-called batch size of 32 or 64 is used, i.e. there are 32 or 64 individual data points in a batch.
The data that arrives in the input layer of the network is already normalized during regular data preprocessing. This means that all numerical values have been brought to a uniform scale and a common distribution and are therefore comparable. In most cases, the data then has a mean value of 0 and a standard deviation of 1.

After the first layer, however, the values have already passed through the so-called activation function, which leads to a shift in the distribution and thus denormalizes the values again. An activation function is run through again with each layer in the network and the numbers are no longer normalized in the output layer. This process is also called "internal covariate shift" in the technical literature (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift).
For this reason, the values are normalized again before each activation function during batch normalization. The following steps are carried out for this purpose.
1. Calculation of the Mean Value and the Variance per Mini-Batch
First, the mean value and the variance for the entire mini-batch are calculated for each activation in a specific layer. If