A Deeper Dive into Odds Ratios Using Logistic Regression

Author:Murphy  |  View: 23701  |  Time: 2025-03-22 20:08:41

PART 2 OF THE DEEP DIVE INTO ODDS RATIOS SERIES

Photo by NEOM on Unsplash

When we build a statistical model, we often focus more on its predictive value. However, we can also leverage it to uncover the story behind the data.

Not a Medium member yet? Continue using this free version!

Logistic Regression is one of the simplest yet most effective models for binary classification. Beyond prediction, we can obtain the odds ratios for each variable in the fitted logistic regression model, which is invaluable for our understanding of the data.

In this article, as a continuation of the first article in the deep dive into odds ratios series, we will explore how to extract odds ratios from logistic regression. We will start by deriving the relationship between the model and odds ratios. Then, we will examine use cases where the logistic regression approach offers several advantages over the basic method of calculating odds ratios, including: calculating for categorical and numerical variables, handling multiple variables, and addressing situations where variables have interaction effects.

Let's get started!

A Quick Recap

Before exploring logistic regression, let's recap our previous discussion on odds ratios and their application to conversion analysis. If you haven't read the article, I strongly suggest doing so first, as we will continue the discussion from that article and maintain the same use case.

A Deep Dive into Odds Ratio

To summarize, odds are defined as the ratio of two probabilities: the probability of success and the probability of not succeeding. Mathematically, we can express odds as p / (1-p), where p represents the probability of success. The odds ratio itself tells us how likely an event occurs when the group is exposed compared to when the group is unexposed.

The interpretation of the odds ratio can be done by comparing its value to 1. If the value is greater than 1, the event is more likely to occur in the exposed group; conversely, if the value is less than 1, it suggests the event is less likely to occur in the exposed group. We can also calculate the confidence interval for the odds ratio to determine whether the value is statistically significant.

User Conversion Use Case

To refresh our memory, here is the case as presented in the previous article:

Let's say you are a data analyst or a product manager for a company focused on SaaS products. One of the products offers both free and paid subscriptions. Since one of the success metrics is the number of users who convert to paid membership within 7 days after signing up, you should be interested in identifying the factors that potentially drive users to convert during this time window

One of the factors we want to examine is whether users utilize the premium feature. Using odds ratio calculations, we found that the odds ratio is 5.904, with a confidence interval ranging from 4.397 to 7.942. This result can be interpreted as users who utilize the premium feature are 5.904 times more likely to convert to a paid subscription.

What is Logistic Regression?

Logistic Regression is a widely known model used to examine the relationship between independent/explanatory variables and a dependent/response variable [1]. Unlike Linear Regression, which has a continuous response variable, the outcome of logistic regression is binary or dichotomous.

As we know, linear regression can be expressed as y = a + bx, where x represents the independent variable and y represents the outcome. However, since the outcome of logistic regression is binary (0 or 1), we need to estimate an equation that produces values between 0 and 1, which represent the probability of the outcome occurring.

This is where the logit function comes into play; it is the natural logarithm of the odds (p / (1 – p)) that an event occurs. The equation below describes the logit function as a linear function of the explanatory variable x.

Eq 1. Logit function

With a little bit of algebra, we can derive the logistic regression function that we commonly use to describe the probability of an event (p) occurring with the predictor variable x.

Eq 2. Logistic Regression equation

From Logit to Odds Ratio

Now that we understand that the logistic regression function for p is derived from the logit function, and that the logit itself is the natural logarithm of odds, we can move forward. By exponentiating the logit function, we obtain the odds of an event occurring.

Eq 3. Odds an exponentiate of logit equation

So, if we substitute x = 1, we will get the odds of an event occurring for the exposed group:

Eq 4. Odds for exposed group

Similarly, if we substitute x = 0, we find the odds of an event occurring for the unexposed group:

Eq 5. Odds of unexposed group

(I hope you know where this goes)

Hence, if we divide the two equations, we derive the odds ratio:

Eq 6. Odds ratio is the exponential of β1

This relationship shows that the odds ratio is the exponential of the predictor's coefficient β1. This is a very useful connection that we will explore further with Python in the next section.

Logistics Regression in Python

Enough with the math! Now let's implement what we know in Python and see if we can validate the results.

statsmodels vs scikit-learn

Like any statistical model, logistic regression requires us to estimate the parameters of the function. In our simple function, the parameters that need to be estimated are β0 and β1.

In Python, there are at least two libraries that are commonly used to fit logistic regression models: scikit-learn and Statsmodels. While scikit-learn is typically the go-to library for building predictive models due to its efficiency, in our case, we need more detailed statistical information. Therefore, we will use statsmodels instead.

Unlike scikit-learn, statsmodels offers more comprehensive information about the results of model fitting. For logistic regression, we can use the logit() function from the library. After fitting the model using the fit() function, we can print the results using the summary() method from the output object.

Install and import packages

Before we get into the implementation, let's ensure we have the necessary package installed using the following command (documentation):

pip install statsmodels

In statsmodels, there are two methods to use the logit() function:

  • Standard Method: Use statsmodels.api.Logit (documentation). In this method, the function requires us to input the dependent variable (endog) and the independent variable (exog).
  • R-Style Formula Method: Use statsmodels.formula.api.logit (documentation). In this method, you provide a formula as a string, specifying the model while inputting the data. The dependent and independent variables are defined within the formula.

Many articles, including my own, favor using the second method. One reason for this preference is that the intercept is included by default in the second method, whereas in the first method, we need to add the intercept manually.

Additionally, especially for our purpose of diving deeper into odds ratios, using a formula provides flexibility in defining the relationship between the independent and dependent variables. I will elaborate on this in the upcoming sections.

So, since we will use the formula method, we can input the package as well as the holy trinity libraries:

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Prepare the dataset

Let's get back to our conversion case, we will use the same dataset generator function as before which you can take it on the bottom part of previous article.

Now let's generate the data and see what we have:

data = generate_dataset(n=1000)
print(data[['user_id', 'usage_of_premium_feature', 'converted_to_paid_in_7_days']])
user_id  usage_of_premium_feature  converted_to_paid_in_7_days
0          1                     False                        False
1          2                      True                         True
2          3                      True                         True
3          4                      True                         True
4          5                     False                        False
..       ...                       ...                          ...
995      996                     False                        False
996      997                      True                        False
997      998                     False                        False
998      999                      True                         True
999     1000                     False                         True

I will not re-introduce each of the fields, but we need to notice that usage_of_premium_feature and converted_to_paid_in_7_days are in boolean type.

Why the type is important? It is because the logit() function is only accepting numeric value as the outcome or dependent variable. Hence, we need to assign the type here into numeric of 1 and 0.

Here is the data preparation before we fitting the model:

# Copy two fields that will be used from original dataset as df_input 
df_input = data[['usage_of_premium_feature', 'converted_to_paid_in_7_days']].copy()
# Map the output value to be numeric as 1 and 0
df_input['converted_to_paid_in_7_days'] = df_input['converted_to_paid_in_7_days'].map({True: 1, False:0})
print(df_input.head())
usage_of_premium_feature  converted_to_paid_in_7_days
0                    False                            0
1                     True                            1
2                     True                            1
3                     True                            1
4                    False                            0

What we have done here is start by copying the required fields from the dataset into a new dataframe called df_input . We mapped the boolean values of the outcome (True to 1, and False to 0) as required by the model.

There is a simple way to perform this mapping using .astype(int), but here I want to generalize the step in case we don't have boolean values for the outcome.

Fit and extract model's summary

We have prepared the data; now let's move on to model fitting. As mentioned earlier, there are two arguments for the function: the formula and the data we just created as df_input.

For the formula, we construct it in the form ~ . While this can be defined in a more complex way, which we will explore later, in this section, we will use the simplest form. To model the prediction of converted_to_paid_in_7_days using usage_of_premium_feature as the predictor, we can write the formula as shown in the code below.

Using df_input and the formula as inputs for logit(), we then fit the model using the fit() function. We store the result in a variable named model .

formula = 'converted_to_paid_in_7_days ~ usage_of_premium_feature'
model = smf.logit(formula=formula, data=df_input).fit()

Now we have the fitted model and let's evaluate the estimated parameter using summary() .

print(model.summary())
Logit Regression Results                               
=======================================================================================
Dep. Variable:     converted_to_paid_in_7_days   No. Observations:                 1000
Model:                                   Logit   Df Residuals:                      998
Method:                                    MLE   Df Model:                            1
Date:                         Wed, 02 Oct 2024   Pseudo R-squ.:                  0.1252
Time:                                 23:47:46   Log-Likelihood:                -584.25
converged:                                True   LL-Null:                       -667.85
Covariance Type:                     nonrobust   LLR p-value:                 3.043e-38
====================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
Intercept                           -0.3331      0.090     -3.684      0.000      -0.510      -0.156
usage_of_premium_feature[T.True]     1.7756      0.146     12.198      0.000       1.490       2.061
====================================================================================================

Whoa, what is this?

The summary above consists of two sections: the model fitting details and the estimated parameter results. Since our focus here is to obtain the odds ratios, we will concentrate only on the second part.

We can see that there are two rows. Referring back to the general function of logistic regression, the "Intercept" is β0, which does not have any predictors attached.

The remaining row contains the coefficient for each predictor; in our case, it indicates whether the user used the premium feature or not (β1). Inside the square brackets, there is information about the exposed group for this variable, which in this case is when usage_of_premium_feature = True .

Getting the odds ratio and its confidence interval

Although the summary of the model above is comprehensive, we still need to perform additional calculations to obtain the value of the odds ratio. As we derived earlier, we need to take the variable's coefficient and exponentiate it.

To extract only the coefficient, we can use .params :

params = model.params
print(params)
Intercept                          -0.333065
usage_of_premium_feature[T.True]    1.775640
dtype: float64

and to get confidence interval, we can use .conf_int :

conf_int = model.conf_int()
print(conf_int)
                                         0         1
Intercept                        -0.510275 -0.155855
usage_of_premium_feature[T.True]  1.490341  2.060940

As we can see, we are able to isolate only the coefficient and its confidence interval. The next step is to exponentiate both of these values to obtain the odds ratio and its confidence interval:

odds_ratios = np.exp(params)
lower_confints = np.exp(conf_int[0])
upper_confints = np.exp(conf_int[1])

results = {
    'index' : ['Intercept', 'Use Premium Feature'],
    'odds_ratios' : odds_ratios,
    'lower_confints' : lower_confints,
    'upper_confints' : upper_confints 
}

df_result = pd.DataFrame(results).reset_index(drop=True)
print(df_result)
                 index  odds_ratios  lower_confints  upper_confints
0            Intercept     0.716724        0.600330        0.855683
1  Use Premium Feature     5.904060        4.438609        7.853346

Voilà!

Finally, we are able to obtain the odds ratio, which is exactly the same as the initial method: 5.904, with a confidence interval ranging from 4.438 to 7.853. The derivation we performed earlier has been validated through our calculations.

Once again, the result indicates that users who utilized the premium feature are 5.904 times more likely to convert to paid membership within 7 days after signing up compared to users who did not utilize the premium feature.

Exploring More on Odds Ratio

It's great that we were able to find the relationship between odds ratios and the logistic regression model. However, you might be wondering what the added value is compared to the simple method discussed in the first article.

Logistic regression, especially using statsmodels, provides several advantages over basic calculations of determining odds ratios:

  1. Handling single categorical variable
  2. Handling single continues variable
  3. Handling multiple variable
  4. Handling multiple variable with interaction

Let's dive deeper!

Odds ratio of single categorical variable

We discussed this topic in the previous article to understand the relationship between the marketing channel and user conversion. However, the basic method required us to perform manual calculations one by one for each category using the same reference category.

The first advantage of using statsmodels is that we can obtain the likelihood for each variable automatically, all at once.

First, let's prepare the data:

# Copy two fields that will be used from original dataset as df_input 
df_input = data[['marketing_source', 'converted_to_paid_in_7_days']].copy()

# Map the output value to be numeric as 1 and 0
df_input['converted_to_paid_in_7_days'] = df_input['converted_to_paid_in_7_days'].map({True: 1, False:0})
print(df_input)
# Copy two fields that will be used from original dataset as df_input 
df_input = data[['marketing_source', 'converted_to_paid_in_7_days']].copy()

# Map the output value to be numeric as 1 and 0
df_input['converted_to_paid_in_7_days'] = df_input['converted_to_paid_in_7_days'].map({True: 1, False:0})
print(df_input)

Then, we will define the formula for this model. Since we treat the variable as categorical, we need to establish a reference value. Remember, last time we used "organic search" as the reference value, meaning we compare the likelihood of conversion from the other channels to that of organic search.

We can define this in the formula string as follows: C(marketing_source, Treatment(reference="Organic")). We wrapped the value in C() to indicate that this is a categorical variable, and we specify that we are using "Organic" as the reference: reference="Organic". With this formula, we can fit the model and examine the summary (I will only show the second part of the summary).

formula = 'converted_to_paid_in_7_days ~ C(marketing_source, Treatment(reference="Organic"))'

model = smf.logit(formula=formula, data=df_input).fit()
print(model.summary().tables[1])
=========================================================================================================================================
                                                                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                                 0.5836      0.121      4.814      0.000       0.346       0.821
C(marketing_source, Treatment(reference="Organic"))[T.Email Campaign]    -0.3472      0.186     -1.867      0.062      -0.712       0.017
C(marketing_source, Treatment(reference="Organic"))[T.Paid Ads]          -0.7298      0.171     -4.258      0.000      -1.066      -0.394
C(marketing_source, Treatment(reference="Organic"))[T.Referral]           0.8082      0.206      3.923      0.000       0.404       1.212
=========================================================================================================================================

As promised, we can obtain all the other marketing source options along with their coefficients and confidence intervals. To understand what each row represents, we can refer to the square brackets: "Email Campaign," "Paid Ads," and "Referral."

Similarly to before, we can isolate and calculate the odds ratio for these marketing channels:

params = model.params
conf_int = model.conf_int()

odds_ratios = np.exp(params)
lower_confints = np.exp(conf_int[0])
upper_confints = np.exp(conf_int[1])

results = {
    'index' : ['Intercept', 'Email Campaign', 'Paid Ads', 'Referral'],
    'odds_ratios' : odds_ratios,
    'lower_confints' : lower_confints,
    'upper_confints' : upper_confints 
}

df_result = pd.DataFrame(results).reset_index(drop=True)
print(df_result)
            index  odds_ratios  lower_confints  upper_confints
0       Intercept     1.792453        1.413368        2.273214
1  Email Campaign     0.706667        0.490826        1.017424
2        Paid Ads     0.481991        0.344471        0.674411
3        Referral     2.243977        1.498490        3.360337

We are able to obtain the odds ratio for each marketing channel. If you'd like, you can continue to create visualizations as shown in the previous article, since all the values are already provided. However, for now, let's continue to explore the other advantages of using logistic regression.

Visualize the likelihood of marketing channels. Image by Author, taken from previous article.

Odds ratio of single continues variable

The basic model can only calculate the odds ratio when we have a binary predictor: use or not use premium feature, email campaign or organic search, or exposed vs unexposed in general.

Another advantage of using logistic regression is that we can also analyze the relationship between a continuous predictor variable and our binary outcome. For example, in our case, we will consider another potential factor: time spent in hours. This variable describes how many hours per day the user spends using the product.

Let's get the data:

# Copy two fields that will be used from original dataset as df_input 
df_input = data[['time_spent_in_hours', 'converted_to_paid_in_7_days']].copy()

# Map the output value to be numeric as 1 and 0
df_input['converted_to_paid_in_7_days'] = df_input['converted_to_paid_in_7_days'].map({True: 1, False:0})
print(df_input)
     time_spent_in_hours  converted_to_paid_in_7_days
0                   1.12                            0
1                   6.04                            1
2                   5.16                            1
3                   4.50                            1
4                   2.91                            0
..                   ...                          ...
995                 1.44                            0
996                 5.69                            0
997                 4.09                            0
998                 3.84                            1
999                 0.12                            1

[1000 rows x 2 columns]

Before we continue, since the variable is continuous, we should examine the statistics of this variable. Using the describe() method, we find that the mean is 3.51 hours, with a minimum value of 0 hours and a maximum value of 9.03 hours. This is quite reasonable if users are using the product for work.

Now, let's move on to the formula and model fitting. As usual, we will review the summary.

formula = 'converted_to_paid_in_7_days ~ time_spent_in_hours'

model = smf.logit(formula=formula, data=df_input).fit()
print(model.summary().tables[1])
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -1.3380      0.155     -8.635      0.000      -1.642      -1.034
time_spent_in_hours     0.5459      0.045     12.211      0.000       0.458       0.633
=======================================================================================

The result is quite straightforward, as we see one variable in the summary: time_spent_in_hours. Next, let's calculate the odds ratio.

params = model.params
conf_int = model.conf_int()

odds_ratios = np.exp(params)
lower_confints = np.exp(conf_int[0])
upper_confints = np.exp(conf_int[1])

results = {
    'index' : ['Intercept', 'Time Spent in Hours'],
    'odds_ratios' : odds_ratios,
    'lower_confints' : lower_confints,
    'upper_confints' : upper_confints 
}

df_result = pd.DataFrame(results).reset_index(drop=True)
print(df_result)
                 index  odds_ratios  lower_confints  upper_confints
0            Intercept     0.262373        0.193656        0.355475
1  Time Spent in Hours     1.726079        1.581286        1.884130

Nice! We have obtained the odds ratio. But wait a minute – what does this value tell us?

Since the variable is continuous, the interpretation will be slightly different [2]. Here, we find that the odds ratio for time spent is 1.72. This can be interpreted as follows: for every one-unit increase in time spent, we expect to see a 72% increase in the likelihood of converting to a paid subscription.

Odds ratio for multiple variables

So far, we've examined various cases where we want to see the relationship between a single variable (whether it's binary, categorical, or continuous) and the outcome.

This method may oversimplify the situation in reality. The reason is that when we calculate each variable one by one, we do not account for other possible variables as additional factors.

For example, in our case, rather than including only the usage of the premium feature in the equation, we should also consider the time spent in general. This can be estimated using logistic regression. Generally, the equation for the logit function will look like the equation below, where we have different coefficients for each variable.

Eq 7. Logit function for multiple variables

Let's see how this can be done in Python. First, prepare the data with the addition of the new variable:

# Copy two fields that will be used from original dataset as df_input 
df_input = data[['usage_of_premium_feature', 'time_spent_in_hours', 'converted_to_paid_in_7_days']].copy()

# Map the output value to be numeric as 1 and 0
df_input['converted_to_paid_in_7_days'] = df_input['converted_to_paid_in_7_days'].map({True: 1, False:0})

Then, we construct the formula, adding two variables: usage_of_premium_feature and time_spent_in_hours. This means that we want to predict conversion within 7 days using both premium usage and time spent as predictors.

formula = 'converted_to_paid_in_7_days ~ usage_of_premium_feature + time_spent_in_hours'

model = smf.logit(formula=formula, data=df_input).fit()
print(model.summary().tables[1])
====================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
Intercept                           -1.1144      0.165     -6.744      0.000      -1.438      -0.791
usage_of_premium_feature[T.True]     0.7643      0.220      3.478      0.001       0.334       1.195
time_spent_in_hours                  0.3750      0.065      5.752      0.000       0.247       0.503
====================================================================================================

From this step alone, we can see a difference in the results compared to single-variable calculations. However, let's continue to calculate the odds ratios:

params = model.params
conf_int = model.conf_int()

odds_ratios = np.exp(params)
lower_confints = np.exp(conf_int[0])
upper_confints = np.exp(conf_int[1])

results = {
    'index' : ['Intercept', 'Use Premium Feature', 'Time Spent in Hours'],
    'odds_ratios' : odds_ratios,
    'lower_confints' : lower_confints,
    'upper_confints' : upper_confints 
}

df_result = pd.DataFrame(results).reset_index(drop=True)
print(df_result)
                 index  odds_ratios  lower_confints  upper_confints
0            Intercept     0.328097        0.237326        0.453586
1  Use Premium Feature     2.147432        1.395895        3.303588
2  Time Spent in Hours     1.455012        1.280466        1.653350

Here we go! The new odds ratio for premium feature usage is 2.147 (initial value: 5.904), while for time spent it is 1.455 (initial value: 1.726). The effect of premium feature usage decreases when we include the time spent factor, but what does this exactly mean, and why is the value different?

Using the basic method of calculating the odds ratio, we do not take into account other factors (often called confounding factors). As a result, the odds ratio is overestimated because the model only considers the usage of the premium feature as the influencing factor.

In this method, we add time spent in hours as a factor alongside premium feature usage. The odds ratio for premium feature usage is now calculated while also controlling for the time spent in hours variable. The result of this computation is referred to as the adjusted odds ratio.

This is why the odds ratio for premium feature usage decreases: we are also accounting for the time spent in hours factor. The same calculation applies when determining the adjusted odds ratio for time spent in hours.

This approach is generally a more accurate way to represent outcomes influenced by multiple factors in real-world scenarios.

Odds ratio for variable with interaction effect

The model from the previous part illustrates the relationship of multiple factors to the outcome, but it does not account for interaction terms between the factors. A model with an interaction term of two variables describes how the effect of one variable depends on another variable [2].

In our case, we can hypothesize that there is an interaction between the usage of the premium feature and the number of hours spent on the product. Therefore, in addition to the main effects, we will also include the effect of the interaction. In general, this can be described as:

Eq 8. Logit function for two variable with interaction term

Although we have two variables, we now obtain four coefficients (including the intercept), with the last coefficient representing the interaction between the first and second variables.

Since the first variable (premium feature usage) is binary (1 or 0), the interaction term indicates the effect of time spent when the user uses the premium feature only.

In statsmodels, we can construct a formula using usage_of_premium_feature * time_spent_in_hours. This formula accounts for both the main effects of premium feature usage and time spent, as well as the interaction between them. If we want to examine the effect of the interaction term only, we can write the formula as usage_of_premium_feature : time_spent_in_hours.

Let's see how it works when we fit the model using both the main effects and the interaction term (the data preparation is exactly the same as in the previous example).

formula = 'converted_to_paid_in_7_days ~ usage_of_premium_feature * time_spent_in_hours'

model = smf.logit(formula=formula, data=df_input).fit()
print(model.summary().tables[1])
========================================================================================================================
                                                           coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------
Intercept                                               -0.9199      0.220     -4.176      0.000      -1.352      -0.488
usage_of_premium_feature[T.True]                         0.2120      0.478      0.443      0.658      -0.725       1.149
time_spent_in_hours                                      0.2825      0.096      2.953      0.003       0.095       0.470
usage_of_premium_feature[T.True]:time_spent_in_hours     0.1710      0.131      1.302      0.193      -0.086       0.428
========================================================================================================================

From the summary, as indicated in the equation above, we have four coefficients, with the last one describing the interaction effect. Let's take a look at the odds ratio.

params = model.params
conf_int = model.conf_int()

odds_ratios = np.exp(params)
lower_confints = np.exp(conf_int[0])
upper_confints = np.exp(conf_int[1])

results = {
    'index' : ['Intercept', 'Use Premium Feature', 'Time Spent in Hours', 'Interaction Term'],
    'odds_ratios' : odds_ratios,
    'lower_confints' : lower_confints,
    'upper_confints' : upper_confints 
}

df_result = pd.DataFrame(results).reset_index(drop=True)
print(df_result)
                 index  odds_ratios  lower_confints  upper_confints
0            Intercept     0.398560        0.258803        0.613789
1  Use Premium Feature     1.236104        0.484229        3.155432
2  Time Spent in Hours     1.326381        1.099633        1.599886
3     Interaction Term     1.186466        0.917190        1.534799

The results show that the main effect of premium feature usage is lower compared to the previous model, at a value of 1.236. On the other hand, the effect of time spent is statistically significant; for every one-unit increase, there is a 32% increase in the likelihood of conversion.

From the interaction term, we know that for every one-unit increment for premium feature users, the likelihood increases by an additional 18%, on top of the main effect of 32%. There is no effect for non-premium feature users regarding the interaction effect.

Conclusion

In the first article of the deep dive into the odds ratio series, we discussed step-by-step how to perform basic calculations and visualize odds ratios.

In this article, we expanded our understanding of odds ratios with the help of logistic regression models. We began by exploring the relationship between logistic regression and odds ratios, as well as how we can leverage this relationship to broaden the use of odds ratios, including handling categorical and numerical variables, accommodating multiple variables, and incorporating interaction between two variables.

As a data analyst, I find this relationship essential for revealing the story behind the data, rather than just making predictions from the model. This insight is invaluable when supporting decision-making processes within an organization.


I hope you found this series helpful, and I look forward to your feedback in the comments section!

Follow me on Medium for more content on similar topics, and let's connect in LinkedIn. I would be very happy to discuss this topic further for your analytical use cases!

Also, don't forget to leave claps (lots of them!) to support me as I continue writing about data and analytics topics!

Tags: Data Analysis Deep Dives Logistic Regression Odds Ratio Statsmodels

Comment