14.2 Multicollinearity

Multicollinearity describes the situation in which two or more independent variables are linearly related. Under perfect multicollinearity: \[\lambda_1 x_1 + \lambda_2 x_2 + \dots +\lambda_k x_k = 0\] where \(\lambda_i\) are constants that are not all zero simultaneously. For example, consider \(x_1=\{8,12,15,17\}\), \(x_2=\{24,36,45,51\}\), and \(x_3=\{2,3,3.75,4.25\}\). In this case, \(\lambda_1=1\), \(\lambda_2=-1/5\), and \(\lambda_3=2\). Note, multicollinearity refers to linear relationships! Including a squared or cubed term is not an issue of multicollinearity. It can be shown that the variance of the estimator increases in the presence of multicollinearity. There are various indications that the data suffers from multicollinearity:

High \(R^2\) but few significant variables
Fail to reject the hypothesis for H\(_0\): \(\beta_i=0\) based on t-values but rejection all slopes being simultaneously zero based on F-test.
High correlation among explanatory variables
Variation of statistically significant variables between models.

14.2.1 Variance Inflated Factors (VIF)

Identifies possible correlation among multiple independent variables and not just two as in the case of a simple correlation coefficient. Consider the model: \[y_i = \beta_0 + \beta_k x_{ik} + \epsilon_i\] The estimated variances of the coefficient \(\beta_k\) is written as \[Var(\beta_k)^* = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2}\] Without any multicollinearity, this variance is minimized. If some some independent variables are correlated with the independent variable \(k\), then \[Var(\beta_k) = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2} \cdot \frac{1}{1-R^2_k}\] where \(R^2_k\) is the \(R^2\) if variable \(x_k\) is taken as the dependent variable. The VIF can be written as \[\frac{Var(\beta_k)}{Var(\beta_k)^*}=\frac{1}{1-R^2_k}\] If \(VIF=1\), then there is no relationship between the variable \(x_k\) and the remaining independent variables. Otherwise, \(VIF>1\). In general, the interpretation is as follows:

VIF of 4 warrants attention
VIF of 10 indicates a serious problem.

14.2.2 Examples

To illustrate the concept of multicollinearity, the data set from nfl is used (Berri et al. (2011)). The first model includes total salary as the dependent variable and the following independent variables: prior season passing yards, pass attempts, experience (squared) in the league, draft round pick, veteran (more than 3 years in the league), pro bowl appearance, and facial symmetry.

bhat = lm(log(total)~yards+att+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)

After estimating the results, the function vif() from the package car is used:

library(car)
vif(bhat)

##      yards        att        exp       exp2     draft1     draft2    veteran changeteam  pbowlever       symm 
##  32.547700  30.920282  39.889877  26.715342   1.621048   1.228091   5.253525   1.194254   1.581753   1.056661

The results indicate multicollinearity for yards, att, and experience. Passings yards and attempts may be correlated and thus, one of them (att) is dropped.

bhat = lm(log(total)~yards+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)

vif(bhat)

##      yards        exp       exp2     draft1     draft2    veteran changeteam  pbowlever       symm 
##   1.460849  39.339639  26.162804   1.616171   1.227479   5.253502   1.141435   1.569621   1.052906

This improves the estimation but experience (and its squared term) are still problematic. The last estimation removes experience and the VIF terms are now in the acceptable range.

bhat = lm(log(total)~yards+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)

vif(bhat)

##      yards     draft1     draft2    veteran changeteam  pbowlever       symm 
##   1.406241   1.653634   1.229459   1.976506   1.101988   1.406095   1.010855

The important part is that the conclusion of the paper has not changed with regard to facial symmetry.