Backpropagation: Step-By-Step Derivation

In the previous article we talked about multi-layer perceptrons (MLPs) as the first neural network model that could solve non-linear and complex problems.
For a long time it was not clear how to train these networks on a given data set. While single-layer perceptrons had a simple learning rule that was guaranteed to converge to a solution, it could not be extended to networks with more than one layer. The AI community has struggled with this problem for more than 30 years (in a period known as the "AI winter"), when eventually in 1986 Rumelhart et al. introduced the Backpropagation algorithm in their groundbreaking paper [1].
In this article we will discuss the backpropagation algorithm in detail and derive its mathematical formulation step-by-step. Since this is the main algorithm used to train neural networks of all kinds (including the deep networks we have today), I believe it would be beneficial to anyone working with neural networks to know the details of this algorithm.
Although you can find descriptions of this algorithm in many textbooks and online sources, in writing this article I have tried to keep the following principles in mind:
- Use clear and consistent notations.
- Explain every step of the mathematical derivation.
- Derive the algorithm for the most general case, i.e., for networks with any number of layers and any activation or loss functions.
After deriving the backpropagation equations, a complete pseudocode for the algorithm is given and then illustrated on a numerical example.
Before reading the article, I recommend that you refresh your calculus knowledge, specifically in the area of derivatives (including partial derivatives and the chain rule of derivatives).
Now grab a cup of coffee and let's dive in