Poisson Regression in R

Introduction
Regression is a vast world. We can do several types of regression analysis depending on the data type. We have covered logistic regression in detail in the previous articles. In this article, I will go through Poisson regression and implement it with an example in R.
Brief Background
Linear regression can be done on numeric data and logistic regression on categorical data. We can perform simple binary logistic regression as well as multiple logistic regression on the dichotomous variable. We can select a partial proportional odd model or a generalized regression model depending on the requirement. But oftentimes we need to deal with data where counting is measured. For example, the number of visits to a museum can be collected from a survey, and in order to model that count response variable, we need Poisson regression. Other types of examples can be the number of visits to the hospital or the number of math courses students have taken during a specific time.
Poisson distribution
A count response variable's Poisson Distribution is expressed as:

Here, x = count variable, λ = average number of events. In Poisson distribution, the average number of events is equal to the variance of that variable. Therefore, λ = variance(x)
Dataset
As a data source for this case study, we are going to use the Adult Data Set that is available in UCI Machine Learning Repository. According to the dataset, approximately 30000 people should be identified based on their demographic characteristics, such as their race, education, occupation, gender, salary, how many hours they work per week, and how much money they earn each month.

We will use the following variables to model "vissci" variable which represents the number of visits to a science or technology museum during the last year
- Education: numeric and continuous
- Marital status: binary (0 for unmarried and 1 for married)
- Gender: binary (0 for female and 1 for male)
- Family income: binary (0 for average or less than average and 1 for more than average)
- Full-time work: binary (0 for part-time and 1 for full-time work)
Implementation in R
The implementation procedure is very similar to the generalized regression model. here, we will use glm() command with the family of poison. In the snippet above, we have defined two models
model1: A single-predictor model. Here we want to model vissci using a single predictor of education years.
model2: A multiple-predictor model. Here we want to model vissci using all the predictor variables.
Interpretation of result
The summary from model 1 is shown below.

It provides similar deviance residuals statistics which are very similar to the linear regression model where the deviance from the linear fitting line is measured. The coefficient for the regression model is shown next. Since we have only educ variable here, there is only one coefficient shown here. The coefficient estimate is 0.13486 and it means that for one unit increase in education, the log of the expected number of visits to a science museum increases by a factor of 0.13486.
There is another term called Incident Rate Ratio (IRR) which is a measurement of the incidence rate for one unit increase of an independent variable. This IRR value can be obtained from the exponentiated value of the coefficients.

Here, the IRR value for educ is 1.14437 which means that for one unit increase in the educ variable, the expected number of visits to a science museum increases by 14.437%. The associated p-value is <0.05 which indicates that educ is a significant parameter that can be used to predict the number of science museum visits.
A similar study can be done on multiple independent variables. Next, we would like to include the remaining variable to model the expected number of visits to a science museum and determine if they have any significant impact or not. Model 2 summary is shown below.

At first glace, we can see that marital and full-time working status are not significant variables to determine the number of visits to a science museum. For gender, we took 0 as female and 1 as male. The coefficient is 0.33612 which means that for one unit increase in gender (i.e. female to male), the log of the expected number of visits to a science museum increases by a factor of 0.13486. For family income, the estimate is 0.57499.

Observing the IRR values, we can say that gender plays a big role in determining our dependent variable. For one unit increase in the gender variable (i.e. female to male), the expected number of visits to a science museum increases by 39.951%. That means males are more likely to visit a science museum than females. The significance of family income is even more. For one unit increase in the faminc variable (i.e. less than average to higher than average), the expected number of visits to a science museum increases by 77.711%. This also means families with higher income are visiting a science museum more than families having lower than average income. Also, as mentioned before, marital status and full-time working status are not significant here.
Key findings
Education years, gender, and family income status are vital to determine the expected number of visits to a science museum.
- For one unit increase in the education years, the expected number of visits to a science museum increases by 14.437%.
- For one unit increase in the gender variable (i.e. female to male), the expected number of visits to a science museum increases by 39.951%. Males visit science museums more than females.
- If family incomes increase to above average, the expected number of visits to a science museum increases by 77.711%.
- Marriage and working status are not significant to determine the expected number of visits to a science museum.
Conclusion
We have covered the fundamental idea behind Poisson distribution and implemented the Poisson Regression model in R. We have taken a dataset from UCI repository and clearly pointed out the impact of some predictors on our desired dependent variable. This type of study is important to understand subtle discrimination in our society. Poisson regression is also critical in several engineering studies as well.
Acknowledgement for Dataset