How to Interpret Logistic Regression Coefficients

Author:Murphy | View: 27553 | Time: 2025-03-23 12:59:16

Image by Dominika Roseclay on Pexels.com

Do you love logistic regression, but hate interpreting anything with any form of logarithmic transformation? Well, I can't say you are in good company, but I can say that you do have me as company!

In this article, I'm going to talk all about interpreting logistic regression coefficients – here's the outline:

Interpreting linear regression coefficients
Why logistic regression coefficient interpretation is challenging
How to interpret logistic regression coefficients
Calculating mean marginal effects with the statsmodels package
Conclusion

Interpreting linear regression coefficients

Most people with an elementary knowledge of statistics fully understand how coefficients are interpreted with linear regression. If that is you, you might consider skipping ahead to the portion of the article that discusses logistic regression coefficients.

Interpreting linear regression coefficients is very simple and easy. The simplicity of interpretation is one of the reasons linear regression is still a very popular tool despite the advent of much more sophisticated algorithms.

Simple linear regression (linear regression with one input variable) takes this form:

We are primarily interested in interpreting _B_₁. For linear regression, this interpretation is simple – for a one-unit change in x, we expect a _B_₁ change in y. Another phrase for this relationship is the ‘mean marginal effect'.

Let's look at an example of how we can interpret _B_₁ using simulation. Simulation is a great tool to test Data Science tools/approaches because we make the baseline truth and then see if our methods are able to identify it.

In the code below, we are simulating 30,000 rows of x values. We simulate the x values by sampling from a normal distribution with the parameters of our choosing (in this case a mean of 2 and standard deviation of 0.2). We then simulate y by multiplying x by our simulated impact of 0.16 and then we add random error; also using sampling form a normal distribution (mean = 0, standard deviation = 2).

from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# simulate linear regression data
def regression_simulation(sim_var, sim_error, sim_coef, size):

    '''
        Simulates data for simple linear regression.

        inputs:
            sim_var (list)    : 2-element list, first element is the mean of a random variable
                                that is being used to simulate a feature in the linear regression, 
                                second is the standard deviation
            sim_error (list)   : 2-element list, first element is the mean of random error being added,
                                 second is the standard deviation
            sim_coef (float)   : impact of the random variable established by sim_var on the target 
                                 variable
            size (int)         : number of units to simulate

        output:
            sim_df (DataFrame) : dataframe with simulated data

    '''

    # create an empty dataframe to populate
    sim_df = pd.DataFrame()

    # create the feature for the linear regression
    sim_df['var'] = np.random.normal(sim_var[0], sim_var[1], size = size)

    # multiply feature by the coef to get a simulated impact
    sim_df['var_impact'] = sim_df['var']*sim_coef

    # create error for the linear regression
    sim_error = np.random.normal(sim_error[0], sim_error[1], size = size)

    # create the target variable
    sim_df['target'] = sim_df['var_impact'] + sim_error

    return sim_df

linear_regression_sim_df = regression_simulation(sim_var = [2, 0.2], 
                                                 sim_error = [0, 2],
                                                 sim_coef = 0.16,
                                                 size = 30000)

lin_reg = LinearRegression()
X = np.array(linear_regression_sim_df['var']).reshape(-1, 1)
y = linear_regression_sim_df['target']
lin_reg.fit(X, y)

This is what we see when we print the coefficient that the linear regression created:

print(lin_reg.coef_)

Nice! That is pretty close to 0.16. If we want to make sure that our coefficient estimate is unbiased, we simulate the dataset multiple times and look at the distribution.

# run multiple simulations
iters = 1000
ceof_list = []

for i in range(iters):

    reg_sim_df = regression_simulation([2, 0.2], [0, 0.1], 0.16, 5000)

    lin_reg = LinearRegression()
    X = np.array(reg_sim_df['var']).reshape(-1, 1)
    y = reg_sim_df['target']
    lin_reg.fit(X, y)

    coef = lin_reg.coef_[0]

    ceof_list.append(coef)

plt.hist(ceof_list, bins = 20)

Looking at the histogram we can see that distribution is centered around 0.16, meaning that our coefficient estimate is unbiased, which is pretty cool!

This whole process is way too easy for linear regression, let's challenge ourselves and start looking at Logistic Regression!

Why logistic regression coefficient interpretation is challenging

Logistic regression is a linear based model like linear regression, but it has a transformation performed on it which keeps the predicted value, y, between 0 and 1. Doing this allows logistic regression to predict probabilities of a target variable belonging to a category.

This is the form of a binary logistic regression:

While this transformation does wonders for adapting linear regression to predicting probabilities, it destroys our ability to directly interpret the coefficients as mean marginal effects!

Let's simulate some binary data to demonstrate. In the code below, we are following the same process as the linear regression simulation, except after y has been simulated, we use sampling from a uniform distribution to make y a 1 or a 0. (Note: we are doubling up on the randomness here, because we are manually adding error through a normal distribution and then the process of converting y to binary also adds some random noise to the simulation).

from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

# simulate binary data
def logistic_regression_simulation(sim_var, sim_error, sim_coef, size):

    '''
        Simulates data for simple logistic regression.

        inputs:
            sim_var (list)    : 2-element list, first element is the mean of a random variable
                                that is being used to simulate a feature in the logistic regression, 
                                second is the standard deviation
            sim_error (list)   : 2-element list, first element is the mean of random error being added,
                                 second is the standard deviation
            sim_coef (float)   : impact of the random variable established by sim_var on the target 
                                 variable
            size (int)         : number of units to simulate

        output:
            sim_df (DataFrame) : dataframe with simulated data

    '''

    # create an empty dataframe to populate
    sim_df = pd.DataFrame()

    # create the feature for the linear regression
    sim_df['var'] = np.random.normal(sim_var[0], sim_var[1], size = size)

    # multiply feature by the coef to get a simulated impact
    sim_df['var_impact'] = sim_df['var']*sim_coef

    # create error term
    sim_df['sim_error'] = np.random.normal(sim_error[0], sim_error[1], size = size)

    # add error and impact together
    sim_df['sum_vars_error'] = sim_df['var_impact'] + sim_df['sim_error']

    # create a uniform random variable used to convert sum_vars_error from continuous to binary
    sim_df['uniform_rv'] = np.random.uniform(size = len(sim_df))

    # create the binary target variable using the uniform random variable
    sim_df['binary_target'] = sim_df.apply(lambda x : 1 if x.sum_vars_error > x.uniform_rv else 0, axis = 1)

    return sim_df

log_reg_sim_df = logistic_regression_simulation([2.00, 0.2], [0, 0.1], 0.16, 30000)

Now that we've simulated the data, let's run a logistic regression and see what our coefficient looks like.

log_reg = LogisticRegression()
X = np.array(log_reg_sim_df['var']).reshape(-1, 1)
y = log_reg_sim_df['binary_target']
log_reg.fit(X, y)

This is the output:

print(log_reg.coef_[0])

What??? That isn't a good feeling at all. That coefficient is nowhere close to the correct answer of 0.16 that we created!

But just to be sure, let's run it a thousand times and look at the distribution.

It looks like our first run was not an outlier. The coefficient is centered far from our simulated impact of 0.16. Of course, this is because logistic regression coefficients can't be directly interpreted in the same way as linear regression coefficients.

In the next section, we will discuss what we can do to interpret logistic regression coefficients.

How to interpret logistic regression coefficients

First of all, let's talk about the sign of logistic regression coefficients. Good news – the signs are interpretable! If the sign is positive, the probability of belonging to the corresponding category is increasing with an increase in x – and vice versa for negative signs. This can be very helpful. Imagine you are developing a model to predict if a customer will purchase a product. You can check the intuition of the model by observing if the coefficient for price is negative (meaning a customer is less likely to purchase a product as the price goes up). While the actual value of the number may be some logarithmically transformed mumbo-jumbo, at least you can understand if the model is making directional sense.

But how can we get to a mean marginal effect for the logistic coefficients? Calculus my friend, calculus.

We want to understand how y changes with a change in x. Derivatives do just that! Let's take the partial derivative of our logistic regression function with respect to x:

One of the big takeaways here is that x is in its own partial derivative. Meaning that how much y moves with a unit move in x depends on the level of x itself.

Note: Mean marginal effect for linear regression is calculated the same way. We just don't think of it like that because the partial derivative with respect to x is simply the constant _B_₁.

So, now we have a way of understanding how a unit movement in x changes y, but x is a part of the equation. How do we get to the mean marginal effect? Unfortunately we can't get there without a reference dataset. We will plug in all of the x's from our reference data set and calculate the average output of our partial derivative. This will finally get us to our mean marginal effect! If our reference data set is representative of our population, we can say that our calculation should be an unbiased estimate of the true mean marginal effect for the logistic regression model.

Armed with this knowledge, let's run the simulation again, but this time calculate the mean marginal effect using the partial derivate we calculated.

from math import exp
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

iters = 1000
mean_marginal_impacts = []
coef_list = []

# create iters number of simulated datasets
for i in range(iters):

    # create simulated data
    log_reg_sim_df = logistic_regression_simulation([2, 0.2], [0, 0.1], 0.16, 20000)

    # run regression and get coefficient and intercept
    log_reg = LogisticRegression()
    X = np.array(log_reg_sim_df['var']).reshape(-1, 1)
    y = log_reg_sim_df['binary_target']
    log_reg.fit(X, y)

    coef = log_reg.coef_[0][0]
    intercept = log_reg.intercept_[0]

    # run the model outputs through the partial derivatives for each simulated observation
    log_reg_sim_df['contribution'] = log_reg_sim_df['var'].apply(lambda x : coef*exp(intercept + (x*coef))/
                                                                         (((exp(intercept + (x*coef)) + 1))**2))

    # calculate the mean of the derivative values
    temp_mean_marginal_impact = log_reg_sim_df['contribution'].mean()

    # save the original coefficient and marginal impact for 
    # this simulation in a list 
    mean_marginal_impacts.append(temp_mean_marginal_impact)
    coef_list.append(coef)

# show the distribution of simulated mean marginal impacts
plt.hist(mean_marginal_impacts, bins = 20)

Beautiful! We see that elusive 0.16 value, what a relief! We now know the process of interpreting logistic regression coefficients and how to calculate them manually! Of course, calculating these effects manually is not very practical, especially as we start adding more x's into our model. Luckily, the statsmodels package in Python has a built-in method for calculating the mean marginal effect. I'll share an example in the next section.

Calculating mean marginal effects with the statsmodels package

We can use the margeff method of the Logit class from statsmodels to calculate the mean marginal effect. The code is below:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.api as sm

iters = 1000
sm_marginal_effects = []

for i in range(iters):

    # simulate data
    log_reg_sim_df = logistic_regression_simulation([2, 0.2], [0, 0.1], 0.16, 20000)

    # define target and predictor variables
    X = np.array(log_reg_sim_df['var'])
    y = log_reg_sim_df['binary_target']

    # add constant to formula - statsmodels.Logit doesn't automatically include
    # an intercept like sklearn
    X_with_intercept = sm.add_constant(X)
    log_reg_sm = sm.Logit(y, X_with_intercept)
    result = log_reg_sm.fit(disp=False)

    # calculate marginal effects
    marginal_effects = result.get_margeff(at = 'all', method = 'dydx')

    # save mean marginal effects in a list
    sm_marginal_effects.append(np.mean(marginal_effects.margeff))

Let's take a look at the average mean marginal effect for the 1000 simulations:

Great, they line up with our expectations! Let's take a quick look at the distribution:

Fantastic! This looks very similar (it will be a little different because of the randomness in the simulation process) to when we did the calculations manually. That's a good feeling!

Conclusion

Understanding what logistic regression coefficients mean can be a little tricky. We can estimate their mean marginal effect by plugging all x values from a reference dataset into the partial derivative of the logistic regression equation. Taking the average of these marginal effects results in the mean marginal effect. The statsmodels.Logit class has a method for calculating mean marginal effects without us having to calculate any partial derivatives. In practice you should use statsmodels (or any other package or software that calculates it for you) but now you know exactly what the code is doing under the hood!

Hopefully you've developed a deeper understanding of logistic regression and how we can interpret the impact of individual predictors in logistic regression.

Link to github repo: https://github.com/jaromhulet/logistic_regression_interpretation

Tags: Data Science Getting Started Logistic Regression Machine Learning Model Interpretability