Logistic Regression: Deceptively Flawed

Author:Murphy | View: 23191 | Time: 2025-03-23 18:32:10

This is a second part to a previous post on conceptual understanding of logistic regression:

Logistic regression: Faceoff and Conceptual Understanding

Last time we visualized and explained fitting log-losses in logistic regression. We also showed that this process cannot fit perfectly separated data. In other words, unlike linear regression with ordinary least square fit, logistic regression actually works better if the data is a little bit noisy!

In practice, does this actually matter? It depends:

It matters if our goal is to use the output for statistical inference. For example, accurately estimating model coefficients, calculating confidence intervals and to test hypotheses using p-values.
It does not matter much or at all, if our goal is to use the output of a logistic model to create a predictive classification model.

Statistical inference: statsmodels

Sample data

For this part, we will use Python's statsmodels library. Keep in mind that statsmodels and scikit-learn (used later) parametrize the probability using _β_s instead of k and x₀:

where the relationship between k, x₀ and β₁, β₀ is:

We will continue with datasets we generated in the first part on logistic regression, first with the "imperfect" data sample_df , using statsmodels' formula API:

import statsmodels.formula.api as smf
model = smf.logit('y ~ x', df).fit()
model.summary()

#> Optimization terminated successfully.
#>         Current function value: 0.230366
#>         Iterations 8

Our model parameters are k = 3 and x₀ = 2.5, so those translate to β₁ = 3 and β₀ = -7.5. We can compare those with fitted parameters by reading them out from the coef column of the bottom table:

We have very few data points and the seed was intentionally chosen to showcase outliers, so the fit is a little bit off, but it is still in the right ballpark. The total log-loss is here reported as "Log-Likelihood", which is just the negative of total log-loss and equals -6.911.

Perfectly separated data

What happens when we run our perfectly separated dataset in statsmodels? Again, it depends!

perfect_sep_model = smf.logit('y ~ x', perfect_sep_data).fit()

#> ...
#> PerfectSeparationError: Perfect separation detected, results not available

In this case we got an error – no results are output. In other cases of perfect separation, we may get a warning. For example, if we use the same parameters but different random seed:

perfect2_df = create_sample_data(k=3, x0=2.5, n_points=30, seed=154)
perfect_sep2_df, perfect_sep_logloss2_df = fit_data_to_logloss(
    perfect2_df,
    k=3,
    x0=2.5
)

perfect_sep_model_2 = smf.logit('y ~ x', perfect_sep2_df).fit()
perfect_sep_model_2.summary()

#> Warning: Maximum number of iterations has been exceeded.
#>         Current function value: 0.000000
#>         Iterations: 35

#> /.../statsmodels/base/model.py:604: ConvergenceWarning: 
#>   Maximum Likelihood optimization failed to converge. Check mle_retvals

Since the second model did not converge either, we can argue that it probably should have also returned an error, not an innocent warning. Logistic regression function in R, glm(..., family=binomial) does the same. To quote R Inferno, Circle 5, Consistency:

There is a problem with warnings. No one reads them. People have to read error messages because no food pellet falls into the tray after they push the button. With a warning the machine merely beeps at them but they still get their food pellet. Never mind that it might be poison.

Therefore, be careful when doing inference using effect sizes and p-values! In this case the data is perfectly separable – we have a perfect predictor – while the reported p-value (P > |z|) is 0.998. Ignoring or misunderstanding these warnings may get you miss some obvious features in the data.

Alternatively, consider using a different model. There is a brave new world outside of logistic regression!

Tags: Bioinformatics Data Science Logistic Regression Odds Ratio Statistics