Diffusion Loss: Every Step Explained

As we are in the midst of exciting times of image generation with diffusion models, in this post, I will focus on the mathematics and the hidden intuition behind the stable diffusion model.
This post aims to decipher how the loss function boils down to a rather simple squared difference term between the true and the predicted noise in an image as shown in the Denoising Diffusion Probabilistic Model paper¹.
If you want to recollect the concept and steps behind Evidence Lower Bound (ELBO) in the context of the variational Bayesian method, please check previous detailed posts on Latent Variable Models and Probabilistic PCA.
Without any delay, let's begin!
Notations & Definitions:
Let's start with a few of the notations we will use several times.
_x0: This would denote an image at time-step 0, i.e. the original image, at the start of the process. Sometimes it also refers to the image recovered in the final step of the denoising process.
_xT: This would be the image at the final time step. At this point, the image is simply an isotropic Gaussian noise.
Forward Process: It is a multi-step process, and in each step, an input image is corrupted with a low-level Gaussian noise. The noisy versions obtained at each time step _x_1, x_2, …., xT are obtained via a Markovian process.
q(_xt | x{t_−1}) ≡ Forward process; Given an image at time step t−1, returns the current image
Markov Chain (MC): Refers to a stochastic process (‘memory less') describing transitions between different states and where the probability of transitioning to any particular state is dependent solely on the current state and time elapsed. A probabilistic definition is given below:

For any positive integer n and possible states _x0, _x__1, _x2,… of the random variables _X1, _X__2, _X3,… the Markov property states that the probability of the current state at step n (_xn) depends solely on the state before at n-1.
Forward Process & MC:
Formally, given a data distribution _x_0∼q(_x_0), the forward Markov process generates a sequence of random variables _x_1,_x_2,…,xT with a transition kernel q(_xt | x{t_−1}). Using the chain rule of probability and the Markov property we can write:

Here, β is some numbers between 0 and 1 i.e. β∈(0,1). In the DDPM paper (ref. 1) the noise schedule was linear. This was shown to be okay for images with high resolution but in the ref. 3, the authors proposed an improved learning schedule³ which also works for sub-optimal resolutions like 32×32. In general, β1 < β2 < β3 < … < βT and as we move through each time step, the new Gaussian will have a mean close to 0. Thus q(_xT | _x__0) ≈ _