17.3 Count Regression Models

Count regression models apply to cases in which the dependent variable represents discrete, integer count data. Here are some examples of count outcomes:

  • What are the number of arrests for a person?
  • What determines the number of credit cards a person owns?

This section on count regression presents three models:

  1. Poisson Regression Model: The condition to use this model is the absence of overdispersion, i.e., the expected value of the dependent variable is equal to the variance.
  2. Quasi-Poisson Regression Model: Overdispersion occurs if the variance of the dependent variable is larger than its mean. In this case, the Poisson Regression Model leads to unreliable hypothesis tests regarding the coefficients. The Quasi-Poisson Model solves this issue.
  3. Negative Binomial Regression Model: The second possibility to deal with overdispersion is to use a Negative Binomial Regression Model.

The main package used is pscl. There is also an additional resource with more theoretical details on the topic: Regression Models for Count Data in R. A more up-to-date version of the document may be found with the pscl package documentation.

17.3.1 Poisson Regression Model

Recall that the Poisson distribution is written as: \[Pr(Y=k)=\frac{e^{-\lambda} \cdot \lambda^k}{k!}\] The important characteristics of the distribution is that the mean and variance are equal to \(\lambda\), i.e., \(E(Y)=\lambda\) and \(Var(Y)=\lambda\), this is also known as the equidispersion property. The mean parameter is written as \(\lambda=exp(\beta_0+\beta_1 \cdot x_1+ \dots + \beta_k \cdot x_k)\).

The first examples uses the 2017 NHTS data contained in hhpub. In a first step, the data is prepared for the regression model, i.e., missing or unknown values are eliminated and income is measured in (thousand) dollar terms:

hhpubdata = subset(hhpub,hhfaminc %in% c(1:11) & homeown %in% c(1,2) & urbrur %in% c(1,2) & hhvehcnt %in% c(0:12))
hhfaminc  = c(1:11)
income    = c(10,12.5,20,30,42.5,57.5,82.5,112.5,137.5,175,200)
income    = data.frame(hhfaminc,income)
hhpubdata = merge(hhpubdata,income)
hhpubdata$rural = hhpubdata$urbrur-1
hhpubdata$rent = hhpubdata$homeown-1

The outcome of interest is the number of vehicles based on household income, home ownership, and urban/rural household location. Before executing the Poisson regression model, calculate the mean and variance of the outcome variable.

## [1] 1.981142
## [1] 1.386027

The mean and the variance are similar and thus, estimating a Poisson Regression Model is an appropriate first step. The package AER also contains the function dispersiontest() which conducts a hypothesis test assuming no overdispersion. This test will be used after execution of the main regression. To estimate the model, the glm() function can be used by specifying family=poisson.

bhat_pois = glm(hhvehcnt~income+rent+rural,data=hhpubdata,family=poisson)
summary(bhat_pois)
## 
## Call:
## glm(formula = hhvehcnt ~ income + rent + rural, family = poisson, 
##     data = hhpubdata)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.654e-01  4.292e-03  108.43   <2e-16 ***
## income       2.986e-03  3.601e-05   82.93   <2e-16 ***
## rent        -3.733e-01  5.797e-03  -64.39   <2e-16 ***
## rural        2.224e-01  4.616e-03   48.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 86505  on 124400  degrees of freedom
## Residual deviance: 68533  on 124397  degrees of freedom
## AIC: 370161
## 
## Number of Fisher Scoring iterations: 5

All coefficients are statistically significant. The signs associated with the coefficients indicates the direction of influence on the outcome variable, i.e., the number of cars. As expected, higher income is associated with a higher number of cars and so is living in a rural environment. Renting is associated with a lower number of vehicles. Note that income and renting is possibly correlated. In general, the coefficient estimates are interpreted using \(\exp(\beta)\). That is, with every unit increase in \(X\), the predictor variable has a multiplicative effect of \(\exp(\beta)\) on the mean of \(Y\), i.e., \(\lambda\):

  • If \(\beta=0\), then \(\exp(\beta) = 1\), and the expected count \(\mu= E(y) = \exp(\alpha)\), and Y and X are not related.
  • If \(\beta>0\), then \(\exp(\beta) > 1\), and the expected count \(E(y)\) is \(\exp(\beta)\) times larger than when \(X = 0\)
  • If \(\beta<0\), then \(\exp(\beta) < 1\), and the expected count \(E(y)\) is \(\exp(\beta)\) times smaller than when \(X = 0\)

The function overdispersion tests the null hypothesis of equidispersion:

dispersiontest(bhat_pois)
## 
##  Overdispersion test
## 
## data:  bhat_pois
## z = -115.75, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.5670593

Given the p-value, the null hypothesis cannot be rejected. If the data suggests overdispersion, two alternative regression models can be used: (1) Quasi-Poisson and (2) Negative Binomial.

17.3.2 Quasi-Poisson Regression Model

The dataset blm used in this section as well as the one describing the negative binomial model is associated with the article Black Lives Matter: Evidence that Police-Caused Deaths Predict Protest Activity. Note that the paper includes a significant number of supplementary materials which allows for the replication of the results and much more.

The dependent variable is the total number of protests in a city. In a first step, the mean and variance of the variable \(totalprotests\) are calclated:

mean(blm$totprotests)
## [1] 0.4959529
var(blm$totprotests)
## [1] 6.35326

The variance is significantly higher than the mean which suggests overdispersion. In a first step, a regular Poisson model is estimated.

eq1 = "totprotests~log(pop)+log(popdensity)+percentblack+blackpovertyrate+I(blackpovertyrate^2)+percentbachelor+collegeenrollpc+demshare"
bhat1 = glm(eq1,data=blm,family=poisson)
summary(bhat1)
## 
## Call:
## glm(formula = eq1, family = poisson, data = blm)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.001e+01  6.327e-01 -31.625  < 2e-16 ***
## log(pop)               1.129e+00  4.007e-02  28.170  < 2e-16 ***
## log(popdensity)       -1.831e-01  8.654e-02  -2.116   0.0343 *  
## percentblack           1.697e-02  3.104e-03   5.467 4.59e-08 ***
## blackpovertyrate       1.461e-01  2.636e-02   5.541 3.02e-08 ***
## I(blackpovertyrate^2) -1.552e-03  3.985e-04  -3.895 9.82e-05 ***
## percentbachelor        3.893e-02  3.918e-03   9.935  < 2e-16 ***
## collegeenrollpc        9.305e-03  2.377e-03   3.914 9.06e-05 ***
## demshare               4.301e-02  5.293e-03   8.126 4.43e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3204.6  on 1225  degrees of freedom
## Residual deviance:  787.4  on 1217  degrees of freedom
##   (133 observations deleted due to missingness)
## AIC: 1242.9
## 
## Number of Fisher Scoring iterations: 6

Note that the signs and statistical significance correspond to Model 1 (Table 3) in the original paper. The coefficients are different because the paper only include negative binomial regression models. The reason for not using a Poisson regression model is the presence of overdispersion:

dispersiontest(bhat1)
## 
##  Overdispersion test
## 
## data:  bhat1
## z = 1.4052, p-value = 0.07998
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   2.212733

Note that the null hypothesis is rejected at the 10% but not the 5% significance level. The Quasi-Poisson Regression Model handles overdispersion by adjusting the stand-errors but leaving the coefficicent estimates the same.

bhat2 = glm(eq1,data=blm,family=quasipoisson)
summary(bhat2)
## 
## Call:
## glm(formula = eq1, family = quasipoisson, data = blm)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.001e+01  9.841e-01 -20.332  < 2e-16 ***
## log(pop)               1.129e+00  6.232e-02  18.111  < 2e-16 ***
## log(popdensity)       -1.831e-01  1.346e-01  -1.360 0.173942    
## percentblack           1.697e-02  4.828e-03   3.515 0.000457 ***
## blackpovertyrate       1.461e-01  4.100e-02   3.562 0.000382 ***
## I(blackpovertyrate^2) -1.552e-03  6.198e-04  -2.504 0.012403 *  
## percentbachelor        3.893e-02  6.094e-03   6.387 2.40e-10 ***
## collegeenrollpc        9.305e-03  3.697e-03   2.517 0.011975 *  
## demshare               4.301e-02  8.233e-03   5.225 2.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.419275)
## 
##     Null deviance: 3204.6  on 1225  degrees of freedom
## Residual deviance:  787.4  on 1217  degrees of freedom
##   (133 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Note that in this case, the population density is not statistically significant anymore.

17.3.3 Negative Binomial Regression Model

The Negative Binomial Regression Model can be used in the presence of count data and overdispersion. Below, the results from the article Black Lives Matter: Evidence that Police-Caused Deaths Predict Protest Activity are recreated using the negative binomial models presented in the paper. The required package is MASS

The first model is a basic model of resource mobilization and opportunity structure.

bhat3 = glm.nb(eq1,data=blm,link=log)
summary(bhat3) 
## 
## Call:
## glm.nb(formula = eq1, data = blm, link = log, init.theta = 1.559078735)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.090e+01  1.117e+00 -18.719  < 2e-16 ***
## log(pop)               1.292e+00  7.159e-02  18.047  < 2e-16 ***
## log(popdensity)       -3.130e-01  1.328e-01  -2.356  0.01848 *  
## percentblack           2.249e-02  4.709e-03   4.777 1.78e-06 ***
## blackpovertyrate       1.319e-01  3.121e-02   4.227 2.37e-05 ***
## I(blackpovertyrate^2) -1.340e-03  4.768e-04  -2.810  0.00496 ** 
## percentbachelor        4.462e-02  5.465e-03   8.163 3.26e-16 ***
## collegeenrollpc        1.051e-02  4.118e-03   2.553  0.01068 *  
## demshare               4.077e-02  7.292e-03   5.591 2.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.5591) family taken to be 1)
## 
##     Null deviance: 1899.84  on 1225  degrees of freedom
## Residual deviance:  501.15  on 1217  degrees of freedom
##   (133 observations deleted due to missingness)
## AIC: 1120.2
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.559 
##           Std. Err.:  0.351 
## 
##  2 x log-likelihood:  -1100.187

In a second model, black deaths are added:

bhat4 = glm.nb(totprotests~log(pop)+log(popdensity)+percentblack+blackpovertyrate+I(blackpovertyrate^2)+percentbachelor+collegeenrollpc+demshare+deathsblackpc,data=blm,link=log)
summary(bhat4) 
## 
## Call:
## glm.nb(formula = totprotests ~ log(pop) + log(popdensity) + percentblack + 
##     blackpovertyrate + I(blackpovertyrate^2) + percentbachelor + 
##     collegeenrollpc + demshare + deathsblackpc, data = blm, link = log, 
##     init.theta = 1.685551835)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.073e+01  1.101e+00 -18.824  < 2e-16 ***
## log(pop)               1.281e+00  7.046e-02  18.178  < 2e-16 ***
## log(popdensity)       -3.054e-01  1.315e-01  -2.323 0.020201 *  
## percentblack           1.801e-02  4.864e-03   3.704 0.000212 ***
## blackpovertyrate       1.283e-01  3.109e-02   4.127 3.67e-05 ***
## I(blackpovertyrate^2) -1.301e-03  4.743e-04  -2.744 0.006071 ** 
## percentbachelor        4.372e-02  5.381e-03   8.125 4.47e-16 ***
## collegeenrollpc        1.005e-02  4.073e-03   2.466 0.013657 *  
## demshare               4.069e-02  7.249e-03   5.613 1.98e-08 ***
## deathsblackpc          2.825e+00  9.312e-01   3.034 0.002414 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6856) family taken to be 1)
## 
##     Null deviance: 1943.21  on 1225  degrees of freedom
## Residual deviance:  500.18  on 1216  degrees of freedom
##   (133 observations deleted due to missingness)
## AIC: 1113.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.686 
##           Std. Err.:  0.404 
## 
##  2 x log-likelihood:  -1091.353

And in the third model, the authors use all police-caused deaths instead (victims of any race):

bhat5 = glm.nb(totprotests~log(pop)+log(popdensity)+percentblack+blackpovertyrate+I(blackpovertyrate^2)+percentbachelor+collegeenrollpc+demshare+deathspc,data=blm,link=log)
summary(bhat5) 
## 
## Call:
## glm.nb(formula = totprotests ~ log(pop) + log(popdensity) + percentblack + 
##     blackpovertyrate + I(blackpovertyrate^2) + percentbachelor + 
##     collegeenrollpc + demshare + deathspc, data = blm, link = log, 
##     init.theta = 1.621799986)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.080e+01  1.108e+00 -18.773  < 2e-16 ***
## log(pop)               1.277e+00  7.122e-02  17.928  < 2e-16 ***
## log(popdensity)       -3.121e-01  1.321e-01  -2.363  0.01812 *  
## percentblack           2.163e-02  4.705e-03   4.596 4.31e-06 ***
## blackpovertyrate       1.295e-01  3.110e-02   4.164 3.12e-05 ***
## I(blackpovertyrate^2) -1.315e-03  4.744e-04  -2.772  0.00558 ** 
## percentbachelor        4.507e-02  5.472e-03   8.237  < 2e-16 ***
## collegeenrollpc        1.040e-02  4.094e-03   2.540  0.01108 *  
## demshare               4.121e-02  7.257e-03   5.678 1.36e-08 ***
## deathspc               9.564e-01  6.330e-01   1.511  0.13085    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6218) family taken to be 1)
## 
##     Null deviance: 1921.82  on 1225  degrees of freedom
## Residual deviance:  502.82  on 1216  degrees of freedom
##   (133 observations deleted due to missingness)
## AIC: 1119.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.622 
##           Std. Err.:  0.374 
## 
##  2 x log-likelihood:  -1097.839