Prediction in Various Logistic Regression Models (Part 2)

Author:Murphy | View: 26042 | Time: 2025-03-23 18:46:48

Introduction

We have covered logistic regression models for both binary and ordinal data and also demonstrated how to implement the model in R. Moreover the prediction analysis using the R libraries was also discussed in earlier articles. We have seen the impact of single as well as multiple predictors on the response variable and quantified it. Binary and ordinal response variables were taken to show how to deal with different types of data. In this article, we will go through four more prediction analyses for logistic regression models namely Generalized Ordinal Regression model, Partial Proportional Odd model, Multinomial Logistic model and Poisson Regression model.

Dataset

Our research will use the same UCI Machine Learning Repository's Adult Data Set as a case study. More than 30000 individuals' demographic data are collected in this dataset. Data include each individual's race, education, job, gender, salary, number of jobs held, hours worked per week, and income earned. To get a refresher, the variables under consideration are shown below.

Education: numeric and continuous. The health status of an individual can be greatly affected by education.
Marital status: binary (0 for unmarried and 1 for married). The impact of this variable will most likely be minimal, however, it has been included in the analysis.
Gender: binary (0 for female and 1 for male). There is also the possibility that it has a lesser impact, but it will be interesting to find out.
Family income: binary (0 for average or less than average and 1 for more than average). Health conditions may be affected by this.
Health status: ordinal (1 for poor, 2 for average, 3 for good and 4 for excellent)

Prediction in Generalized Ordinal Regression Model

Consider the case where we have collected data on hundreds of individuals. Among the data included is information regarding the individual's education, age, marital status, health status, gender, family income, and full-time employment status. Education, gender, marital status, and family income are to be included as predictor variables in the regression model for health status. Except for education, the predictor variables are all binary, which means they have either a 0 or a 1 value. Education is a continuous variable that indicates the number of years an individual has been educated. The following variables are considered for this regression analysis.

Education years
Marital status
Gender
Family income
Health status

The coefficient value for each predictor variable will be one if we perform an ordinal logistic regression and hold the proportional odd assumption. Suppose family income has a coefficient of ‘x', which means that for every unit increase in family income (in this case from 0 to 1), the logit probability or log odds of being in a higher category of health status increases by ‘x'. As a result, we can conclude the following statements about this model.

The log odds of being at average health from poor health is ‘x' if family income increases to above average status.
The log odds of being at good health from average health is ‘x' if family income increases to above average status.
The log odds of being at excellent health from good health is ‘x' if family income increases to above average status.

A proportional odd model is characterized by the same log odds across all levels of outcomes. Real-world data frequently violates this assumption, so we cannot proceed with the proportional odd model. As discussed earlier, two possible solutions to address this nonproportional odd issue are to have either a generalized ordinal model or a partial proportional odd model.

Generalized ordinal regression model -> the effect of all level of all predictors can vary
Partial proportional odd model -> the effect of some level of all/some predictors are allowed to vary

We have already implemented the model using generalized approach and PPO approach in earlier articles.

Generalized Ordinal Regression Model in R

Partial Proportional Odd Model in R

Now we will implement the prediction procedure using these models.

Here, we can see the cumulative predicted probabilities of having different health statuses for the provided educ values. We know that our health status has four unique values.

If the individual has 15 years of education,

The cumulative probability of having average health and above is 96%
The cumulative probability of having good health and above is 77%
The cumulative probability of having excellent health is 24%

If the individual has only 5 years of education,

The cumulative probability of having average health and above is 81%
The cumulative probability of having good health and above is 41%
The cumulative probability of having excellent is 8%

Therefore, it is evident that the number of education years plays a significant role in determining the health status of an individual. If we want to obtain only the predicted probabilities, we can execute the following command.

ggpredict(model1, terms = "educ[5,10,15]",ci=NA)

If the individual has 15 years of education,

The probability of having poor health is 4%
The probability of having average health is 20%
The probability of having good health is 52%
The probability of having excellent health is 24%

If the individual has only 5 years of education,

The probability of having poor health is 19%
The probability of having average health is 40%
The probability of having good health is 33%
The probability of having excellent health is 8%

Clearly, the number of education years increases the probability of having better health. All of these values are adjusted for the mean values of marital, gender and full-time working status.

Prediction in Partial Proportional Odd Model

In a partial proportional odd model, we can select the predictors for which we want to vary the effect of different levels of outcomes. We can first determine which predictors are violating the PO assumption and then place those variables after parallel = FALSE ~ command. Here, we have placed marital status and family income as violating predictors.

If the individual has 15 years of education,

The probability of having poor health is 4%
The probability of having average health is 20%
The probability of having good health is 52%
The probability of having excellent health is 24%

If the individual has only 5 years of education,

The probability of having poor health is 17%
The probability of having average health is 41%
The probability of having good health is 35%
The probability of having excellent health is 7%

The cumulative probabilities can also be calculated using the method described before.

Prediction in Multinomial Regression Model

We have covered Multinomial logistic regression analysis in the following article.

Multinomial Logistic Regression in R

Multinomial regression is a statistical method of estimating the likelihood of an individual falling into a specific category in relation to a baseline category utilizing a logit or log odds approach. Essentially, it works as an extension of the binomial distribution when there are more than two outcomes associated with the nominal response variable. As part of multinomial regression, we are required to define a reference category, and the model will determine various binomial distribution parameters based on the reference category.

In the following code, we have defined the first level of health status as the reference level and we will compare the multiple binomial regression model with respect to this reference level.

Our prediction approach yielded the following result.

If the individual has 15 years of education,

The probability of having poor health is 4%
The probability of having average health is 19%
The probability of having good health is 52%
The probability of having excellent health is 25%

Again, these predicted probabilities are calculated holding other predictors at their mean. In multinomial logistic regression, the response variable should be nominal. However, the response here is converted to ordinal to use ggpredict() command.

Prediction in Poisson Regression Model

There are times when we need to deal with data that involves counting. In order to model a count response variable, such as the number of museum visits, we need Poisson regression. The number of visits to the hospital or the number of math courses taken by a particular group of students can also serve as examples. We have covered Poisson regression in the following article

Poisson Regression in R

We are going to use the same dataset and predict the number of science museum visits from education years, gender, marital status, full-time working status and family income. The code block is shown below.

Using the same ggpredict() command, we obtain the following result for different education years as well as for different genders.

The predicted number of science museum visits is 0.44 if the individual is female(gender=0) and has 15 years of education
The predicted number of science museum visits is 0.62 if the individual is male(gender=1) and has 15 years of education
It implies that females visit science museums less than males. The conclusion is adjusted for the mean values of marital status, full-time working status and family income.

Conclusion

In this article, we have covered prediction analysis for four different types of regression models. The Partial Proportional odd model can be considered as a subset of the generalized ordinal regression model since PPO model allows only a few predictors to vary their effect across different levels. The multinomial regression model is useful for nominal response variables which have unordered categories. Lastly Poisson regression model is good for the prediction of count variables. We have demonstrated the use of ggpredict() function in all four regression models and the interpretation of result as well.