Which Regression technique should you use?
When you work in a certain field for long enough, there are some classes, concepts, lessons, and teachers that you will always remember.
For example, my mom is a teacher and she remembers the substitute teacher that made her fall in love with philosophy for the first time. My Tae Kwon Do master will always remember the first class of Tae Kwon Do when he was only a kid and the excitement that mounted inside him.
I am a Machine Learning Engineer. Professionally speaking, Machine Learning is the thing I love the most and it's probably the subject that I know better.
A class that I will always remember is when my first Machine Learning professor during my bachelor's degree described the difference between classification and regression. An example of a classification task is identifying whether an email is spam or not given its text. An example of a regression task is predicting the price of a house based on its features (e.g. size, location, etc…).
We define a set of features as a matrix (table) X with k columns and n rows. In both the classification and regression tasks, the output is a vector y that has n entries (same as the number of rows of X). The difference is that in classification tasks, the entries of y are integer numbers. Referring to the previous example, y_1=0 means that the first email is not spam and y_1=1 means that the first email is a spam email. In regression tasks, the entries of y are real numbers. Referring to the house price example, y_1 = 123780 means that the price of the house that we aim to predict for house number 1 is 123780.
Now, the regression tasks can be approached in multiple ways. Actually, in A LOT of ways… maybe too many ways to handle. If we have so many methods to solve a single problem, choosing the right way can be very hard. With the explosion of AI and clickbait titles, choosing the best method has become increasingly difficult, as many articles and papers claim to have the ultimate solution (buy my course to get the code for sale!!!!) for every single regression problem.
The truth is that every dataset should be solved with a specific algorithm, depending on the specific properties of the data and the specific requirements that we want to achieve.
This blog post aims to be a user-friendly guide on the best regression task to use based on:
- The linearity/polynomial linearity of the dataset
- Complexity of the dataset
- The dimensionality of the dataset (number of columns)
- The need for a probabilistic output
For the sake of this study, we will only consider traditional machine learning methods (no Neural Networks) as we want to mainly focus on small synthetic datasets.
Are you ready to rock'n'roll?