Gaussian Processes from Scratch

Author:Murphy | View: 27771 | Time: 2025-03-22 23:28:49

Introduction

Gaussian Processes (GPs) are an incredible class of models. There are very few Machine Learning algorithms that give you an accurate measure of uncertainty for free while still being super flexible. The problem is, GPs are conceptually really difficult to understand. Most explanations use some complex algebra and probability, which is often not useful to get an intuition for how these models work.

There also are many great guides that skip the maths and give you the intuition for how these models work, but when it comes to using GPs yourself, in the right context, my personal belief is that surface knowledge won't cut it. This is why I wanted to walk through a bare-bones implementation, from scratch, so that you get a clearer picture of what's going on under the hood of all the libraries that implement these models for you.

I also link my GitHub repo, where you'll find the implementation of GPs using only NumPy. I've tried to abstract from the maths as much as possible, but obviously there is still some that are required…

Data

The first step is always to have a look at the data. We are going to use the monthly CO2 atmospheric concentration over time, measured at the Mauna Loa observatory, a common dataset for GPs [1]. This is intentionally the same dataset that sklearn use in their GP tutorial, which teaches how to use their API and not what is going on under the hood of the model.

Monthly CO2 parts per million (ppm) at the Mauna Loa observatory. (Image by Author)

This is a very simple dataset, which will make it easier to explain the maths that will follow. The notable features are the linear upwards trend as well as the seasonal trend, with a period of one year.

What we will do is separate the seasonal component and linear components of the data. To do this, we fit a linear model to the data.

The data overlaid with the least squares fit. (Image by Author)

We can then plot the residuals, the targets minus our model predictions: this will be the periodic information.

Residuals of the data and the least squares fit. (Image by Author)

This is the data we will model with a GP.

The Gaussian Process

The Gaussian Process is a multivariate Gaussian distribution, where each data point is a "dimension". This means that adding a data point to your dataset means adding a dimension to the distribution: your model becomes more complex as you add more data! This is different to most supervised models such as linear regression, where the complexity scales with the number of features.

Depending on how you define the correlation between the dimensions (=data points), you can have a different looking model.

Kernels

I am skipping and simplifying a lot here. An important concept to understand for GPs are kernels. For all the purposes here: kernels are symmetric, PSD matrices that look like a correlation matrix, but can quantify other relationships in the data than simple linear correlations. We can use them as correlation matrices for a multivariate Gaussian.

The great thing you can do with kernels is add them and multiply them, as the properties of kernels are conserved through those operations. You can give your kernel different properties based on how you construct it.

We are going to use a combination of 3 kernels for this example:

A periodic distance kernel: encodes periodic relationships between data points. The periodic relationships are damped by distance between data points.
A distance kernel: data points which are close in squared distance will have a high correlation value.
A noise kernel which is applied to the diagonal: this is the noise we assume each observation to have.

In a regular GP implementation, the hyperparameters (Greek letters) here would be optimised by some routine: here I have hand-picked them.

The GP Prior

We can then draw samples with this kernel by using a mean of 0 for now, and the years we are interested in, this is commonly called the "prior". Notice how we are only using input data here, not the targets.

Samples from the Gaussian Process Prior. (Image by Author)

We can see that the samples we have drawn look quite a lot like the residuals we had earlier. This confirms we have made an appropriate choice for our kernel function.

The GP Posterior

We can now condition the GP on the data, this is called the GP posterior. To get the posterior *** mean and covariance, we use the following formulas, where __ x are training data with respective targets y. x are the points we want to make predictions on. Every time we want to make new predictions, we need to recalculate the covariance kernel and the mean vector.

You can find the derivations in textbooks such as [2] (or do it yourself, which is a good challenge of linear algebra skills).

A note on computational complexity

You'll notice that to compute posterior mean and covariance, we need to invert the training data kernel. Matrix inversion has an O(N³) cost but, unlike matrix multiplication, you cannot parallelise the operations. This is the main problem with GPs and why they are rarely used in Big Data applications (there are ways around this [2]).

Sampling from our conditioned GP:

Samples from the Gaussian Process Posterior. (Image by Author)

We can see that, unlike the prior, the samples are all very similar to each other: particularly near 2022, closer to the training data. Adding the linear law we made earlier:

Samples from the Gaussian Process Posterior added to the least squares fit. (Image by Author)

Drawing samples from the GP is useful, but what is usually done is simply taking the mean as the prediction. We can then use the diagonal of the covariance kernel as the standard deviation (making sure to take the square root).

Mean and standard deviation of the Gaussian Process. (Image by Author)

As we expect, the further from the real data we get, the higher the uncertainty. Putting everything together:

Data and mean Gaussian Process extrapolation. (Image by Author)

Conclusion

GPs are a powerful tool for a Machine Learning toolbox, especially in low-data regimes. They are particularly useful when an estimate of the uncertainty is needed.

The main point I try to show here is the importance of choosing the correct kernel for the job. Here, even a simple dataset required the use of 3 kernels, for more complex datasets, this could be many more. This requires good knowledge of the data and an ability to translate those into inductive biases.