10.1 Chi-Square Test (\(\chi^2\)-Test)

The \(\chi^2\)-test is used to conduct a hypothesis test on qualitative variables. Before introducing the procedure, a presentation of the \(\chi^2\)-distribution is necessary.

Given the values \(Z_1, Z_2, \dots , Z_k\) independently drawn from standard normal distribution, then the squared sum of those values follows a \(\chi^2\)-distribution: \[\sum_{i=1}^k Z_i^2 = Z \sim \chi^2_k\] where \(k\) specifies the degrees of freedom.

To conduct the hypothesis test, consider the data on voting and education in gss. Using the function CrossTable associated with the package gmodels, the following table can be constructed.

CrossTable(gss$degree,gss$vote12,prop.r = FALSE,prop.c = FALSE,prop.chisq = FALSE,prop.t=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  2806 
## 
##  
##                | gss$vote12 
##     gss$degree | did not vote |   ineligible |        voted |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##       bachelor |           84 |           17 |          425 |          526 | 
## ---------------|--------------|--------------|--------------|--------------|
##       graduate |           32 |           17 |          266 |          315 | 
## ---------------|--------------|--------------|--------------|--------------|
##    high school |          447 |          122 |          870 |         1439 | 
## ---------------|--------------|--------------|--------------|--------------|
## junior college |           67 |            7 |          137 |          211 | 
## ---------------|--------------|--------------|--------------|--------------|
## lt high school |          168 |           37 |          110 |          315 | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |          798 |          200 |         1808 |         2806 | 
## ---------------|--------------|--------------|--------------|--------------|
## 
## 

This table is similar to what has been presented previously except that all proportions have been removed and only the counts are presented. This is a 5-by-2 contingency table (the total columns are not considered part of the table). The variable education is less than high school (0), high school (1), junior college (2), bachelor (3), and graduate (4). Assuming independence, the expected value \(E\) for each cell is \[E=\frac{(\text{total of row})\cdot(\text{total of column})}{\text{total count}}\] For example, consider the cell “high school” and “voting” which contains 214 counts. The expected value under the null hypothesis that voting behavior is independent of education leads to the following: \[E_{1,1}=\frac{367 \cdot 518}{772} = 246.2513\] Those calculations can be conducted for each cell. Then, then the \(\chi^2\)-test statistic is calcualted as follows: \[\chi^2 = \sum_{i=1}^{r \cdot c} \frac{(O_i-E_i)^2}{E_i}\] where \(r \cdot c\) is number of rows multiplied by the number of colunms, \(O_i\) and \(E_i\) are the observed and exepected count in a cell, respectively. The degrees of freedom are calcualted as \((r-1)\cdot(c-1)\). Of course, instead of doing it manually, the following command can be used:

chisq.test(gss$degree,gss$vote12)
## 
##  Pearson's Chi-squared test
## 
## data:  gss$degree and gss$vote12
## X-squared = 256.2, df = 8, p-value < 2.2e-16

Due to the p-value being very small, we reject the hypothesis of independence. The \(\chi^2\) hypothesis test should only be used for qualitative data. For example, do not categorize income into quartiles to conduct a hypothesis test on whether voting depends on income.

The example of a 2-by-2 table is a special case

votegun   = subset(gss,
                   vote12 %in% c("voted","did not vote") &
                        owngun %in% c("yes","no"),
                   select = c("vote12","owngun"))
votegun$vote12 = ifelse(votegun$vote12=="voted",1,0)
votegun$owngun = ifelse(votegun$owngun=="yes",1,0)
t.test(votegun$vote12~votegun$owngun)
## 
##  Welch Two Sample t-test
## 
## data:  votegun$vote12 by votegun$owngun
## t = -2.583, df = 1187.1, p-value = 0.009913
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.10556727 -0.01442532
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6758193       0.7358156
chisq.test(votegun$vote12,votegun$owngun)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  votegun$vote12 and votegun$owngun
## X-squared = 6.1159, df = 1, p-value = 0.0134
CrossTable(votegun$vote12,votegun$owngun,prop.r = FALSE,prop.c = FALSE,prop.chisq = FALSE,prop.t=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  1693 
## 
##  
##                | votegun$owngun 
## votegun$vote12 |         0 |         1 | Row Total | 
## ---------------|-----------|-----------|-----------|
##              0 |       366 |       149 |       515 | 
## ---------------|-----------|-----------|-----------|
##              1 |       763 |       415 |      1178 | 
## ---------------|-----------|-----------|-----------|
##   Column Total |      1129 |       564 |      1693 | 
## ---------------|-----------|-----------|-----------|
## 
##