Sklearn Tutorial: Module 1

Author:Murphy | View: 27264 | Time: 2025-03-23 11:59:35

After years of playing with the Python scientific stack (NumPy, Matplotlib, SciPy, Pandas, and Seaborn), it became obvious to me that the next step was scikit-learn, or "sklearn".

But why sklearn ?

Among the ML libraries, scikit-learn is the de facto simplest and easiest framework to learn ML. It is based on the scientific stack (mostly NumPy), focuses on traditional yet powerful algorithms like linear regression/support vector machines/dimensionality reductions, and provides lots of tools to build around those algorithms (like model evaluation and selection, hyperparameter optimization, data preprocessing, and feature selection).

But its main advantage is, without a doubt, its documentation and user guide. You can literally learn almost everything just from the scikit-learn website, with lots of examples.

Note that other popular frameworks are TensorFlow and Pytorch, but they have steeper learning curves and focus on more complex subjects like computer vision and neural networks. Since this my first real contact with ML, I figured I'd start with sklearn.

I already started reading the documentation a few months ago but was kinda lost given its size. While the documentation is huge and very well written, I am not sure the best way to learn scikit-learn is to follow through the whole documentation one page after another.

The good news, and the thing that triggered my intent to learn scikit-learn further, was the start of the "official" MOOC of scikit-learn, created by the actual team of scikit-learn.

Machine learning in Python with scikit-learn

In this series, I will try to summarize what I learned from each of the 6 modules that compose the MOOC. This is an excellent exercise for me to practice my memory and summarize what I learned, and a good introduction for you if you want to get in touch with Sklearn.

Note that the MOOC is free, so if you like what you read below, you should definitely subscribe! Note that these posts are my curated vision of the MOOC, which is itself just an introduction to scikit-learn.

Module 1: Machine Learning Concepts

This first module focuses on introducing the following notions:

splitting data into train/test set
column selector/transformers
model, pipeline, and the estimator API with the .fit(), .transform, .predict(), .score() methods
cross-validation

So our program for today is to review those concepts with mostly words and very little code. If you want to go further, I strongly recommend heading to the documentation.

Splitting data into train set and test set

One of the most important best practices in ML is to split the data into train sets and test sets. The idea is that given a fixed-size input data, we are going to train the model with a subpart of the whole data – the train set – and test its performance on the other part – the test set.

This method is very important for many reasons: the whole point of ML and models is to be able to guess the output from new, unseen data. If we use the whole data to train the model, we have no other choice than to use the same data to test its performance. Obviously, this seems like a biased exercise: of course, the model would be able to guess the outputs based on inputs it has already seen, all the more since it also had access to the corresponding outputs. This concept is also known as generalization versus memorization: we want the model to generalize (extrapolate outputs for new input data), not just memorize the data it was trained with.

Another reason (actually another way to put this) is to avoid overfitting. Overfitting is a very important concept in ML and will be studied more in another module. For now, let's just say that a model "overfits" when it learns too much, too precisely the data it is trained with. Having a test set, different from the train set, lets us check that the performance of the model is about the same on the train set as on the test set.

To do this, a simple yet important function provided by scikit-learn is train_test_split:

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
    data,           # the input array that contains all the input features
    target,         # the target array that contains the truth
    test_size=0.25, # the percentage of data attributed to the test set, defaults to 0.25
    shuffle=True,   # if the data should be randomly split (defaults to True)
    random_state=42 # this is an optional parameter to be able to have reproducible splitting
)

The idea then is to use data_train and target_train to train our model and test its performance on unseen data using data_test and target_test.

Column transformer/preprocessor

Often, the raw input data is not very well formatted and needs some kind of preprocessing steps before actually going into a typical model. For example, if the input data contains a categorical column stored as a string column, and the model only uses numerical inputs, we need to convert this string column into a numerical column (that encodes the same information) for the model to leverage the information of this feature.

Another typical example is when several numerical features live on very different scales and/or units. Models usually benefit from having data with the same scales- meaning having more or less the same mean and or variation scale around their mean.

Once those preprocessing steps are applied, the transformed data is sent to the actual model.

To achieve these preprocessing steps, scikit-learn proposes some useful tools. The first are the preprocessing functions (actually stored in classes), that help improve the scales of each feature or encode a categorical feature into numerical format.

Once instantiated, those objects can be used to preprocess the data:

from sklearn.preprocessing import StandardScaler, OneHotEncoder

scaler = StandardScaler()
encoder = OneHotEncoder()

# Example of StandardScaler
data = np.arange(100)
scaler.fit(data)                            # scaler 'fits' to the data and stores the mean and variance for later use
scaled_data = scaler.transform(data)        # apply the actual transformation
scaled_data = scaler.fit_transform(data)    # do both at once
other_scaled = scaler.transform(other_data) # note that one can apply the same transformation to other data

# Example of OneHotEncoder
data = [['toto'], ['titi'], ['toto'], ['tata']]
encoder.fit_transform(data).toarray() # fit and transform the column into OneHotEncoded columns
array([[0., 0., 1.],
 [0., 1., 0.],
 [0., 0., 1.],
 [1., 0., 0.]])
# The encoder creates 3 new columns, where the first one corresponds to 'toto', 2nd 'titi' and so on

Now let's review the ColumnTransformer class: it allows you to specify the mapping between some columns with some preprocessors. The basic example is to map numerical columns to a standard scaler and map categorical columns to a one-hot encoder. So let's say we have a list of numerical columns and a list of categorical columns, we create a new preprocessor object like this:

from sklearn.compose import ColumnTransformer

categorical_cols = ['gender', 'country']
numerical_cols = ['age', 'weight', 'height']

preprocessor = ColumnTransformer(
     [ # a list of 3-tuple : (name of the preprocessor, the actual preprocessor, list of columns names to apply on)
         ("onehot_cat", OneHotEncoder(), categorical_cols), 
         ("stdsc_num", StandardScaler(), numerical_cols),
     ],
     remainder=OneHotEncoder(), # the remaining cols that are not identified in cat/num cols will be OneHotEncoded
)

This way, preprocessor is now a new preprocessor like StandardScaler in the sense that it has a fit/transform API. We can use this new preprocessor as a global preprocessor for our model, that can be applied directly to the full data matrix. Note that the make_column_transformer does the same thing, without having to specify the names for each preprocessor.

But what if we don't know the names of the columns in the data matrix beforehand? or what if we don't want to review all the columns and just make a mapping based on their dtypes? For this, we can use the make_column_selector helper function, that basically creates filters to extract columns for a whole data matrix, to map a preprocessor in a ColumnTransformer:

from sklearn.compose import make_column_selector

num_selector = make_column_selector(dtype_exclude=object) # suppose every non-object dtype is numerical
cat_selector = make_column_selector(dtype_include=object) # suppose every object dtype is categorical

preprocessor = ColumnTransformer(
    [
         ("onehot_cat", OneHotEncoder(), cat_selector), 
         ("stdsc_num", StandardScaler(), num_selector),
    ],
    remainder=OneHotEncoder(), # the remaining cols that are not identified in cat/num cols will be OneHotEncoded
)

This way, we created pretty much the same ColumnTransformer preprocessor object, but with no assumption on the columns' names.

Pipeline

We now have all the tools to define a pipeline: a pipeline is basically the concatenation of various processors. Since we already have some preprocessor, we just need to add a predictive model, like a linear regression or support vector classifier.

For this, I'd recommend using the Pipeline constructor:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipeline = Pipeline(
    [
        ('std', StandardScaler()),
        ('lin_reg', LinearRegression())
  ]
)

The newly created pipeline object, again, exposes the fit/transform API (like the preprocessor we saw above, and also like the LinearRegression() instance). We can use this to train and test the performance of our model (the pipeline) on our splitted data:

pipeline.fit(data_train, target_train) # make the model "learn"
y_predicted = pipeline.transform(data_test) # make a prediction on unseen data
# or directly compute a score to measure the performance on the test set
pipeline.score(data_test, target_test)
# to check the training score, we can use
pipeline.score(data_train, target_train)

Cross-validation

The final subject I want to introduce for this first module is cross-validation.

Remember in the beginning when we split the input data into a train set and a test set? Well, the actual split might seem kinda arbitrary. What if, by luck or bad luck, the data was split in a particular way that would be advantageous or disadvantageous for the performance of the model?

To circumvent this risk, we can use what is called cross-validation: the idea is to split the data in different ways and train and test the model for each of them. The overall performance of the model is given by the mean of the performance of each split. For example, the first split will use the first 75% entries to train and the last 25% to test the model. Starting over, the second split will use the 25%-100% entries to train and the first 25% entries to test. And so on. For each split, the model is fitted and tested from the beginning.

from sklearn.model_selection import cross_validate
cv_results_dict = cross_validate(
    pipeline,   # our model
    data,       # the whole input data
    target,     # the whole target data
    cv=10,      # number of splits (defaults to 5)
    return_estimator=True,   # so we can retrieve each fitted pipeline
    return_train_score=True, # to get the train scores, in addition to the test scores
)

The output is a dictionary that contains a lot of information, including the test-score and model, for each split of the data. Note that there are several approaches to split the data, including random split or the most common KFold that splits the data into continuous subgroups.

Complete working example

Let's review all we saw in a full example using the iris toy dataset.

In the example below, we create 5 pipelines based on 5 classification type of models, namely logistic regression, decision tree, random forest, support vector, and K-nearest neighbors. The idea here is not to understand how each of these models works, but rather see the overall process of creating a pipeline that include preprocessors and models, and how to compute their performance in a robust way.

Here, instead of using the Pipeline constructor that can seem a bit verbose, we use the helper function make_pipeline. Also, instead of just specifying the number of splits for the cross-validation, we explicitely specify the kind of folds we want to use: here 10-fold randomized splits.

import numpy as np
from sklearn.model_selection import cross_validate, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Create models with StandardScaler in a Pipeline
models = {
    'Logistic Regression': make_pipeline(StandardScaler(), LogisticRegression()),
    'Decision Tree': make_pipeline(StandardScaler(), DecisionTreeClassifier()),
    'Random Forest': make_pipeline(StandardScaler(), RandomForestClassifier()),
    'SVM': make_pipeline(StandardScaler(), SVC()),
    'K-Nearest Neighbors': make_pipeline(StandardScaler(), KNeighborsClassifier())
}

# Perform 10-fold cross-validation for each model
for model_name, model in models.items():
    cv_results = cross_validate(
        model, X, y, 
        cv=KFold(n_splits=10, shuffle=True, random_state=42),
        return_train_score=True
    )

    print(f'Model: {model_name}')
    print('---------------------------')
    print(f'Test Accuracy: {np.mean(cv_results["test_score"]):.4f} ± {np.std(cv_results["test_score"]):.4f}')
    print(f'Train Accuracy: {np.mean(cv_results["train_score"]):.4f} ± {np.std(cv_results["train_score"]):.4f}')
    print('n')

Takeaway

In this first post, we saw :

what it means to split the data in a training and test sets, why we do this, and how
how to create column transformer/preprocessor, that are used to apply transformations to the input features
the concept of pipeline, which means to concatenate various steps like preprocessor and model, in order to create complex models from basic tools
and finaly what is cross-validation: why and how we must evaluate our model performance in a robust way

Now, please give this post: