Theoretical Deep Dive Into Linear Regression
EXPLAINABLE AI
Most aspiring data science bloggers do it: write an introductory article about linear regression – and it is a natural choice since this is one of the first models we learn when entering the field. While these articles are great for beginners, most do not go deep enough to satisfy senior data scientists.
So, let me guide you through some unsung, yet refreshing details about linear regression that will make you a better data scientist (and give you bonus points during interviews).
This article is quite math-heavy, so in order to follow, it is beneficial to have some solid foundation with probabilities and calculus.
The Data Generation Process
I'm a big fan of thinking about the data generation process when modeling. People who dealt with Bayesian modeling know what I mean, but for the others: Imagine you have a dataset (X, y) consisting of samples (x, y). Given x, how to get to a target y?
Let us assume that we have n data points and that each x has k components/features.
For a linear model with the parameters _w_₁, …, wₖ (coefficients), b (intercept), σ (noise), the assumption is that the data generation process looks like this:
- Compute µ = _w_₁_x_₁ + _w_₂_x_₂ + … + wₖxₖ + b.
- Roll a random y ~ N(µ, σ²). This is independent of other randomly generated numbers. Alternatively: Roll ε ~ N(0, σ²) and output y = µ + ε.
That's it already. These simple two lines are equivalent to the most important linear regression assumptions that people like to explain at great length, namely linearity, homoscedasticity, and independence of errors.
From step 1. of the process, you can also see that we model the expectation µ with the typical linear equation _w_₁_x_₁ + _w_₂_x_₂ + … + wₖxₖ + b rather than the actual target. We know that we will not hit the target anyway, so we settle for the mean of the distribution that generates y instead.
Extensions
Generalized Linear Models. We are not forced to use a normal distribution for the generation process. If we deal with a dataset that only contains positive targets, it might be beneficial to assume that a Poisson distribution Poi(µ) is used instead of a normal distribution. This gives you Poisson regression.
If our dataset only has the targets 0 and 1, use a Bernoulli distribution Ber(p), where p = sigmoid(µ), et voilà: you got logistic regression.
Only numbers between 0, 1, …, n? Use a binomial distribution to get binomial regression.
The list goes on and on. Long story short:
Think about which distribution could have generated the labels __ you observe in the data.
What Are We Actually Minimizing?
Ok, so we decided on a model now. How do we train it now? How do we learn the parameters? Of course, you know: we minimized the (mean) squared error. But why?
The secret is that you just do a plan maximum likelihood estimate using the generation process we described before. The labels that we observe are _y_₁, _y_₂, …, yₙ, all of them independently generated via a normal distribution with means _µ_₁, _µ_₂, …, µₙ. What is the likelihood to see these y‘s? It is:

We now want to find the parameters (that are hidden in the µᵢ‘s) to maximize this term. This is equivalent to minimizing the mean squared error, as you can see.
Extensions
Unequal Variances. In fact, σ does not have to be constant. You can have a different σᵢ for each observation in your dataset. Then, you would minimize

instead, which is least squares with sample weights s. Modeling libraries typically allow you to set these weights. In scikit-learn, for example, you can set the sample_weight
keyword in the fit
function.
This way, you can put more emphasis on certain observations by increasing the corresponding s. This is equivalent to decreasing the variance σ², i.e. you are more confident that the error for this observation is smaller. This method is also called weighted least squares.
Variances Depending on The Input. You can even say that the variance is also dependent on the input x. In this case, you get the interesting loss function that is also called variance attenuation:

The whole derivation process is outlined here:
Get Uncertainty Estimates in Regression Neural Networks for Free
Regularization. Instead of only maximizing the likelihood of the observed labels ** _ y₁, y₂, …, yₙ_, you can take a Bayesian standpoint and maximize the a posteriori likeliho**od

Here, p(y | w) is the likelihood function from above. We have to decide on a probability density for p(w), a so-called prior or a prior distribution. If we say that the parameters are independently normally distributed around 0, i.e. wᵢ ~ N(0, ν²), then we end up with L2 regularization, i.e. ridge regression. For a Laplace distribution, we recover L1 regularization, i.e. LASSO.
Why is that? Let's use the normal distribution as an example. We have

so together with our formula for p(y | w) from above, we have to maximize

which means that we have to minimize the mean squared error plus some regularization hyperparameter time the L2 norm of w.
Note that we dropped the denominator p_(y) from the Bayes formula since it does not depend on w, so we can ignore it for optimization._
You can use any other prior distribution for your parameters to create more interesting regularizations. You can even say that your parameters w are normally distributed but correlated with some correlation matrix Σ.
Let us assume that Σ is positive-definite, i.e. we are in the non-degenerate case. Otherwise, there is no density p_(w)._
If you do the math, you will find out that we then have to optimize

for some matrix Γ. Note: Γ is invertible and we have Σ⁻¹ = ΓᵀΓ. This is also called Tikhonov regularization.
Hint: start with the fact that

and remember that positive-definite matrices can be decomposed into a product of some invertible matrix and its transpose.
Minimize The Loss Function
Great, so we defined our model and know what we want to optimize. But how can we optimize it, i.e. learn the best parameters that minimize the loss function? And when is there a unique solution? Let's find out.
Ordinary Least Squares
Let us assume that we do not regularize and don't use sample weights. Then, the MSE can be written as

This is quite abstract, so let us write it differently as

Using matrix calculus, you can take the derivative of this function with respect to w (we assume that the bias term b is included there).

If you set this gradient to zero, you end up with

If the (n × k)-matrix X has a rank of k, so does the (k × k)-matrix _X_ᵀX, i.e. it is invertible. Why? It follows from rank(X) = rank(_X_ᵀX).
In this case, we get the unique solution

Note: Software packages do not optimize like this but instead use gradient descent or other iterative techniques because it is faster. Still, the formula is nice and gives us some high-level insights about the problem.
But is this really a minimum? We can find out by computing the Hessian, which is _X_ᵀX. The matrix is positive-semidefinite since _w_ᵀ_X_ᵀXw = |Xw|² ≥ 0 for any w. It is even strictly positive-definite since _X_ᵀX is invertible, i.e. 0 is not an eigenvector, so our optimal w is indeed minimizing our problem.
Perfect Multicollinearity
That was the friendly case. But what happens if X has a rank smaller than k? This might happen if we have two features in our dataset where one is a multiple of the other, e.g. we use the features height (in m) and height (in cm) in our dataset. Then we have height (in cm) = 100 * height (in m).
It can also happen if we one-hot encode categorical data and do not drop one of the columns. For example, if we have a feature color in our dataset that can be red, green, or blue, then we can one-hot encode and end up with three columns _color_red, colorgreen, and _colorblue. For these features, we have _color_red + color_green + colorblue = 1, which induces perfect multicollinearity as well.
In these cases, the rank of _X_ᵀX is also smaller than k, so this matrix is not invertible.
End of story.
Or not? Actually, no, because it can mean two things: (_X_ᵀX)_w = X_ᵀy has
- no solution or
- infinitely many solutions.
It turns out that in our case, we can obtain one solution using the Moore-Penrose inverse. This means that we are in the case of infinitely many solutions, all of them giving us the same (training) mean squared error loss.
If we denote the Moore-Penrose inverse of A by A⁺, we can solve the linear system of equations as

To get the other infinitely many solutions, just add the null space of _X_ᵀX to this specific solution.
Minimization With Tikhonov Regularization
Recall that we could add a prior distribution to our weights. We then had to minimize

for some invertible matrix Γ. Following the same steps as in ordinary least squares, i.e. taking the derivative with respect to w and setting the result to zero, the solution is

The neat part:
XᵀX + ΓᵀΓ is always invertible!
Let us find out why. It suffices to show that the null space of _X_ᵀX + ΓᵀΓ is only {0}. So, let us take a w with (_X_ᵀX + ΓᵀΓ)w = 0. Now, our goal is to show that w = 0.
From (_X_ᵀX + ΓᵀΓ)w = 0 it follows that

which in turn implies |Γw| = 0 → Γw = 0. Since Γ is invertible, w has to be 0. Using the same calculation, we can see that the Hessian is also positive-definite.
Nice, so Tikhonov regularization automatically helps make the solution unique! Since ridge regression is a special case of Tikhonov regression (for Γ = λIₖ, Iₖ is the k-dimensional identity matrix), the same holds there.
Adding Sample Weights
As a last point, let us also add sample weights to the Tikhonov regularization. Adding sample weights is equivalent to minimizing

For some diagonal matrix S with positive diagonal entries sᵢ. Minimizing is as straightforward as in the case of ordinary least squares. The result is

Note: The Hessian is also positive-definite.
Homework for you
Assume that for the Tikhonov regularization, we do not impose that the weights should be centered around 0, but some other point _w_₀. Show that the optimization problem becomes

and that the solution is

This is the most general form of Tikhov regularization. Some people prefer to define P := _S_², Q := ΓᵀΓ, as done here.
Conclusion
In this article, I took you on a journey through several advanced aspects of Linear Regression. By adopting a generative view, we could see that generalized linear models just differ from the normal linear models only in the type of distribution that is used to sample the target y.
Then we have seen that minimizing the mean squared error is equivalent to maximizing the likelihood of the observed values. If we impose a prior normal distribution on the learnable parameters, we end up with Tikhonov (and L2 as a special case) regularization. We can use different prior distributions such as the Laplace distribution as well, but then there are no closed solution formulas anymore. Still, convex programming approaches also let you find the best parameters.
As a last step, we found a lot of direct solution formulas for each minimization problem considered. These formulas are usually not used in practice for large datasets, but we could see that the solutions are always unique. And we also learned to do some calculus on the way.