Mastering Logistic Regression

Author:Murphy  |  View: 26326  |  Time: 2025-03-23 18:33:29
Image by Gerd Altmann from Pixabay

Logistic Regression is one of the most common machine learning algorithms. It can be used to predict the probability of an event occurring, such as whether an incoming email is spam or not, or whether a tumor is malignant or not, based on a given labeled data set.

Due to its simplicity, logistic regression is often used as a baseline against which other, more complex models are evaluated.

The model has the word "logistic" in its name, since it uses the logistic function (sigmoid) to convert a linear combination of the input features into probabilities.

It also has the word "regression" in its name, since its output is a continuous value between 0 and 1, although it is typically used as a binary classifier by choosing a threshold value (usually 0.5) and classifying inputs with probability greater than the threshold as the positive class, and those below the threshold as the negative class.

In this article we will discuss the logistic regression model in depth, implement it from scratch in Python, and then show its implementation in Scikit-Learn.

Background: Binary Classification Problems

Recall that in supervised machine learning problems, we are given a training set of n labeled samples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where x is a m-dimensional vector that contains the features of sample i, and yᵢ represents the label of that sample. Our goal is to build a model whose predictions are as close as possible to the true labels.

In classification problems, the label yᵢ can take one of k values, representing the k classes to which the samples belong. More specifically, in binary classification problems, the label yᵢ can assume only two values: 0 (representing the negative class) and 1 (representing the positive class).

In addition, we distinguish between two types of classifiers:

  1. Deterministic classifiers output a hard label for each sample, without providing probability estimates for the classes. Examples for such classifiers include perceptrons, K-nearest neighbors, and SVMs.
  2. Probabilistic classifiers output probability estimates for the classes, and then assign a label to the given sample based on these probabilities (typically the label of the class with the highest probability). Examples for such classifiers include logistic regression, naïve Bayes classifiers, and neural networks that use sigmoid or softmax in the output layer.

The Logistic Regression Model

Logistic regression is a probabilistic classifier that handles binary classification problems. Given a sample (x, y), it outputs a probability p that the sample belongs to the positive class:

If this probability is higher than some threshold value (typically chosen as 0.5), then the sample is classified as 1, otherwise it is classified as 0.

How does the model estimate the probability p?

The basic assumption in logistic regression is that the log-odds of the event that the sample belongs to the positive class is a linear combination of its features.

Log-odds (also called logit) is the logarithm of the odds ratio, which is the ratio between the probability that the sample belongs to the positive class and the probability that it belongs to the negative class:

The log-odds (logit) function

We assume here that the base of the logarithm is e (i.e., natural logarithm), although other bases can be used as well.

The graph of the logit function is shown below:

The logit function

As can be seen, the logit function maps probability values in (0, 1) into real numbers in (-∞, +∞).

In logistic regression, we assume that the log odds is a linear combination of the features, i.e.,

where w = (_w_₀, …, wₘ) are the parameters (or weights) of the model. The parameter _w_₀ is often called the intercept (or bias).

The points for which p = 0.5 (i.e., the log odds is equal to 0) define the separating hyperplane between the two classes, whose equation is:

The equation of the separating hyperplane

The weight vector w is orthogonal to this hyperplane. Every example above the hyperplane (wx > 0) is classified as a positive example, whereas every example below the hyperplane (wx < 0) is classified as a negative example:

This makes logistic regression a linear classifier, since it assumes that the boundary between the classes is a linear surface. Other linear classifiers include perceptrons and SVMs.

We can find a direct correlation between p and the parameters w, by exponentiating both sides of the log-odds equation:

where σ is the sigmoid function (also known as the logistic function):

The sigmoid function is used to convert the log-odds (wx) into probabilities. It has a characteristic "S" or shaped curve:

The sigmoid function

As can be seen, the function maps real numbers in (-∞, +∞) into probability values in (0, 1).

The sigmoid function has some nice mathematical properties that will be useful later:


The following diagram summarizes the computational process of logistic regression starting from the inputs until the final prediction:

The logistic regression model

Log Loss

Our goal is to find the parameters w that will make the model's predictions p = σ(wx) as close as possible to the true labels y. To that end, we need to define a loss function that will measure how far our model's predictions are from the true labels. This function needs to be differentiable, so it can be optimized using techniques such as gradient descent (for more information on loss functions in Machine Learning see this article).

The loss function used by logistic regression is called log loss (or logistic loss). It is defined as follows:

The log loss function

How did we get to this function? The derivation of this function is based on the maximum likelihood principle. More specifically, we can show that log loss is the negative log likelihood under the assumption that the labels have a Bernoulli distribution (i.e., a probability distribution of a binary random variable that takes 1 with probability p and 0 with probability 1 − p).

Mathematically, we will show that:

where P(y|p) is the probability of getting the label y given the model's prediction p (i.e., the likelihood of the data given our model).

Proof:

Given a model of the data (the labels) as a Bernoulli distribution with parameter p, the probability that a sample belongs to the positive class is simply p, i.e.,

Similarly, the probability that the sample belongs to the negative class is:

We can write these two equations more compactly as follows:

Explanation: when y = 1, = p and (1 − p)¹⁻ʸ = 1, therefore P(y|p) = p. Similarly, when y = 0, = 1 and (1 − p)¹⁻ʸ = 1 − p, therefore P(y|p) = 1 − p.

Therefore the log likelihood of the data given the model is:

The log loss is exactly the negative of this function. Therefore, maximizing the log likelihood is equivalent to minimizing the log loss.


The following plot shows the log loss when y = 1:

The log loss equals to 0 only in case of a perfect prediction (p = 1 and y = 1, or p = 0 and y = 0), and approaches infinity as the prediction gets worse (i.e., when y = 1 and p → 0 or y = 0 and p → 1).

The cost function calculates the average loss over the whole data set:

This function can be written in a vectorized form as follows:

where y = (_y_₁, …, yₙ) is a vector that contains the labels of all the training samples, and p = (_p_₁, …, pₙ) is a vector that contains the predicted probabilities of the model for all the training samples.

This cost function is convex, i.e., it has a single global minimum. However, there is no closed-form solution for finding the optimal **w*** (due to the non-linearities introduced by the log function). Therefore, we need to use iterative optimization methods such as gradient descent in order to find the minimum.

Gradient Descent

Gradient descent is an iterative approach for finding a minimum of a function, where we take small steps in the opposite direction of the gradient in order to get closer to the minimum:

Gradient descent

In order to use gradient descent to find the minimum of the cost J(w), we need to compute its partial derivatives with respect to each one of the weights. The partial derivative of J(w) with respect to a given weight wⱼ is:

Proof:

Thus, the gradient vector can be written in a vectorized form as follows:

And the gradient descent update rule is:

where α is a learning rate that controls the step size (0 < α < 1).

Note that whenever you use gradient descent, you must make sure that your data set is normalized (otherwise gradient descent may take steps of different sizes in different directions, which will make it unstable).

Implementation in Python

We will now implement the logistic regression model in Python from scratch, including the cost function and gradient computation, optimizing the model using gradient descent, evaluation of the model, and plotting the final decision boundary.

For the demonstration we will use the Iris data set (BSD license). The original data set contains 150 samples of Iris flowers that belong to one of three species (setosa, versicolor and virginica). We will change it into a binary classification problem by using only the first two types of flowers (setosa and versicolor). In addition, we will use only the first two features of each flower (sepal width and sepal length).

Loading the Data Set

Let's first import the required libraries and fix the random seed in order to get reproducible results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(0)

Next, we load the data set:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, :2]  # Take only the first two features
y = iris.target

# Take only the setosa and versicolor flowers
X = X[(y == 0) | (y == 1)]
y = y[(y == 0) | (y == 1)]

Let's plot the data:

def plot_data(X, y):
    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], style=iris.target_names[y], 
                    palette=['r','b'], markers=('s','o'), edgecolor='k')
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[1])
    plt.legend() 
plot_data(X, y)
The Iris data set

As can be seen, the data set is linearly separable, therefore logistic regression should be able to find the boundary between the two classes.

Next, we need to add a column of ones to the features matrix X in order to represent the bias (_w_₀):

# Add a column for the bias
n = X.shape[0] 
X_with_bias = np.hstack((np.ones((n, 1)), X))

We now split the data set into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)

Model Implementation

We are now ready to implement the logistic regression model. We start by defining a helper function to compute the sigmoid function:

def sigmoid(z):
    """ Compute the sigmoid of z (z can be a scalar or a vector). """
    z = np.array(z)
    return 1 / (1 + np.exp(-z))

Next, we implement the cost function that returns the cost of a logistic regression model with parameters w on a given data set (X, y), and also its gradient with respect to w.

def cost_function(X, y, w):
    """ J, grad = cost_function(X, y, w) computes the cost of a logistic regression model 
        with parameters w and the gradient of the cost w.r.t. to the parameters. """
    # Compute the cost
    p = sigmoid(X @ w)    
    J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) 

    # Compute the gradient    
    grad = (1/n) * X.T @ (p - y)  
    return J, grad

Note that we are using the vectorized forms of the cost and the gradient functions that have been shown previously.

To sanity check this function, let's compute the cost and gradient of the model on some random weight vector:

w = np.random.rand(X_train.shape[1])
cost, grad = cost_function(X_train, y_train, w)

print('w:', w)
print('Cost at w:', cost)
print('Gradient at w:', grad)

The output we get is:

w: [0.5488135  0.71518937 0.60276338]
Cost at w: 2.314505839067951
Gradient at w: [0.36855061 1.86634895 1.27264487]

Gradient Descent Implementation

We will now implement gradient descent in order to find the optimal **w*** that minimizes the cost function on a given training set. The algorithm will run at most _maxiter passes over the training set (defaults to 5000), unless the cost has not decreased by at least tol since the previous iteration (defaults to 0.0001), in which case the training stops immediately.

def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):
    """ Optimize the model using gradient descent.
        X, y: The training set        
        alpha: The learning rate
        max_iter: The maximum number of passes over the training set (epochs)
        tol: The stopping criterion. Training will stop when (new_cost > cost - tol)
    """
    w = np.random.rand(X.shape[1])
    cost, grad = cost_function(X, y, w)

    for i in range(max_iter):
        w = w - alpha * grad
        new_cost, grad = cost_function(X, y, w)        
        if new_cost > cost - tol:
            print(f'Converged after {i} iterations')
            return w, new_cost
        cost = new_cost

    print('Maximum number of iterations reached')
    return w, cost

Normally at this point you would have to normalize your data set, since gradient descent does not work well with features that have different scales. In our specific data set normalization is not necessary since the ranges of the two features are similar.

Let's now call this function to optimize our model:

opt_w, cost = optimize_model(X_train, y_train)

print('opt_w:', opt_w)
print('Cost at opt_w:', cost)

The algorithm converges after 1,413 iterations and the optimal **w*** we get is:

Converged after 1413 iterations
opt_w: [ 0.28014029  0.80541854 -1.48367938]
Cost at opt_w: 0.28389717767222555

There are other solvers you can use for the optimization which are often faster than gradient descent, such as conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for more details on how to use these optimizers.

Using the Model for Predictions

Now that we have found the optimal parameters of the model, we can use it for predictions.

First, let's write a function that gets a matrix of new samples X and returns their probabilities of belonging to the positive class:

def predict_prob(X, w):
    """ Return the probability that samples in X belong to the positive class
        X: the feature matrix (every row in X represents one sample)
        w: the learned logistic regression parameters  
    """
    p = sigmoid(X @ w)
    return p

The function computes the predictions of the model by simply taking the sigmoid of Xᵗw (which computes σ(wx) for every row x in the matrix).

For example, let's find the probability that a sample located at (6, 2) belongs to the versicolor class:

predict_prob([[1, 6, 2]], opt_w)
array([0.89522808])

This sample has 89.52% chance of being a versicolor flower. This makes sense since this sample is located well within the area of the versicolor flowers far from the border between the classes.

On the other hand, the probability that a sample located at (5.5, 3) belongs to the versicolor class is:

predict_prob([[1, 5.5, 3]], opt_w)
array([0.56436688])

This time the probability is much lower (only 56.44%), since this sample is close to the border between the classes.


Let's write another function that returns the predicted class labels instead of probabilities:

def predict(X, w):
    """ Predict whether the label is 0 or 1 for the samples in X using a threshold of 0.5
        (i.e., if sigmoid(X @ theta) >= 0.5, predict 1)
    """
    p = sigmoid(X @ w)
    y_pred = (p >= 0.5).astype(int)
    return y_pred

The function simply predicts 1 whenever the probability of belonging to the positive class is at least 0.5, and 0 otherwise.

Let's test this function with the samples from above:

predict([[1, 6, 2], [1, 5.5, 3]], opt_w)
array([1, 1])

As expected, both of the samples are classified as 1.

Evaluating the Model

Next, let's write a function to compute the accuracy of the model on a given data set:

def evaluate_model(X, y, w):
    y_pred = predict(X, w)
    accuracy = np.mean(y == y_pred)
    return accuracy

The function first finds the predicted labels of the model on the given data set X, and compares them to the true labels y. The accuracy is then measured as the mean number of correct classifications:

Let's use this function to find the accuracy of our model on the training and the test sets:

train_accuracy = evaluate_model(X_train, y_train, opt_w)
print(f'Train accuracy: {train_accuracy * 100:.3f}%')
Train accuracy: 98.667%
test_accuracy = evaluate_model(X_test, y_test, opt_w)
print(f'Test accuracy: {test_accuracy * 100:.3f}%')
Test accuracy: 100.000%

As expected, the scores are very high since the data set is linearly separable.

In addition to accuracy, there are other important metrics that are used to evaluate classification models such as precision, recall and F1 score. These metrics will be discussed in a future post.

Plotting the Decision Boundary

Finally, since our data set is two-dimensional, we can plot the boundary line between the classes that was found by our model. To that end, we first need to find the equation of this line.

The boundary line is defined by the points for which the prediction of our model is exactly 0.5, i.e.,

The sigmoid function is equal to 0.5 when its input is equal to 0, therefore we can write:

After rearranging the terms we get:

That is, the slope of the boundary line is -_w_₁/_w_₂ and its intercept is -_w_₀/_w_₂. We can now write a function that plots this line:

def plot_decision_boundary(X, y, w):
    """ Plot the decision boundary between the classes """
    plot_data(X, y)

    line_x = np.array(plt.gca().get_xlim())
    line_y = -1 / w[2] * (w[1] * line_x + w[0])
    plt.plot(line_x, line_y, c='k', ls='--')
plot_decision_boundary(X, y, opt_w)
The decision boundary between the classes

We can see that only one sample was misclassified by our model. Training the model for more iterations (around 200,000) would have found a line that perfectly separates the two classes. With a fixed step size, the optimal convergence rate of gradient descent is very slow. This can be improved by using an adaptive learning rate (e.g., using more aggressive step sizes to compensate for the rapidly vanishing gradients).

The LogisticRegression Class in Scikit-Learn

Although implementing logistic regression from scratch had its own educational merits, a more practical choice would be to use the ready-made LogisticRegression class from Scikit-Learn. This class uses more efficient solvers than the plain vanilla gradient descent, and it also provides additional options such as regularization and early stopping.

The important hyperparameters of this class are:

  • penalty – specifies the type of regularization to apply. Can be one of the options: None, ‘l2' (the default), ‘l1' and ‘elasticnet'. See this article for more information on regularization.
  • tol – the tolerance for the stopping criterion (defaults to 0.0001).
  • C – the inverse of the regularization coefficient (defaults to 1.0). Smaller values specify stronger regularization.
  • solver – the algorithm to use for the optimization. Can take one of the options: ‘lbfgs' (the default), ‘liblinear', ‘newton-cg', ‘newton-cholesky', ‘sag', ‘saga'. Read the documentation for more information on these optimizers.
  • _maxiter – the maximum number of iterations for the solvers to converge (defaults to 100)
  • _multiclass – how to handle multi-class classification problems. Can take one of the options: ‘ovr' (One Vs. Rest, i.e., a binary classifier is built for each class against all the other ones), ‘multinomial' (uses multinomial logistic regression) or ‘auto' (the default).

When using the LogisticRegression class, you do not need to manually add a column of ones to the design matrix X, since this is done automatically by Scikit-Learn. Therefore, before building the model, we will split the original data (without the extra column of ones) into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Let's create an instance of LogisticRegression with its default settings, and fit it to the training set:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

Next, we evaluate the model on the training and the test sets:

train_accuracy = clf.score(X_train, y_train)
print(f'Train accuracy: {train_accuracy * 100:.3f}%')

test_accuracy = clf.score(X_test, y_test)
print(f'Test accuracy: {test_accuracy * 100:.3f}%')
Train accuracy: 100.000%
Test accuracy: 100.000%

This time we get perfect scores both on the training and the test sets. We can also check how many iterations were needed for convergence by querying the _n_iter__ attribute:

print(clf.n_iter_)
[15]

Only 15 iterations were needed for convergence! Clearly, the solver used by LogisticRegression (L-BFGS by default) is more efficient than our implementation of gradient descent.

We can plot the decision boundary found by the model as we did previously. However, this time the optimal coefficients are stored in two different attributes of the model:

  • _coef__ is an array that contains all the weights except for the intercept
  • _intercept__ is the intercept term (_w_₀)

Therefore, we need to concatenate these two attributes into one array before calling the _plot_decisionboundary() function:

opt_w = np.insert(clf.coef_, 0, [clf.intercept_])
plot_decision_boundary(X, y, opt_w)
The decision boundary found by LogisticRegression

As expected, the line found by LogisticRegression perfectly separates the two classes.

Summary

Let's summarize the pros and cons of logistic regression as compared to other classification models.

Pros:

  • When the data is linearly separable, the algorithm is guaranteed to find a separating hyperplane between the classes.
  • Provides class probability estimates
  • Does not suffer from overfitting (but usually underfits the data)
  • Highly interpretable (the weight associated with each feature represents its importance)
  • Highly scalable (requires a number of parameters linear in the number of features)
  • Can handle redundant features (by assigning them weights close to 0)
  • Has a small number of hyperparameters

Cons:

  • Can find only linear decision boundaries between classes
  • Usually outperformed by more complex models
  • Supports only binary classification, but can be extended to multi-class. The extension of logistic regression to multi-class problems (called multinomial logistic regression or softmax regression) is covered in this article.
  • Cannot deal with missing values

Final Notes

All images unless otherwise noted are by the author.

The code examples of this article can be found on my github: https://github.com/roiyeho/medium/tree/main/logistic_regression

Thanks for reading!

Tags: Artificial Intelligence Data Science Getting Started Logistic Regression Machine Learning

Comment