14.2 Multicollinearity
Multicollinearity describes the situation in which two or more independent variables are linearly related. Under perfect multicollinearity: \[\lambda_1 x_1 + \lambda_2 x_2 + \dots +\lambda_k x_k = 0\] where \(\lambda_i\) are constants that are not all zero simultaneously. For example, consider \(x_1=\{8,12,15,17\}\), \(x_2=\{24,36,45,51\}\), and \(x_3=\{2,3,3.75,4.25\}\). In this case, \(\lambda_1=1\), \(\lambda_2=-1/5\), and \(\lambda_3=2\). Note, multicollinearity refers to linear relationships! Including a squared or cubed term is not an issue of multicollinearity. It can be shown that the variance of the estimator increases in the presence of multicollinearity. There are various indications that the data suffers from multicollinearity:
- High \(R^2\) but few significant variables
- Fail to reject the hypothesis for H\(_0\): \(\beta_i=0\) based on t-values but rejection all slopes being simultaneously zero based on F-test.
- High correlation among explanatory variables
- Variation of statistically significant variables between models.
14.2.1 Variance Inflated Factors (VIF)
Identifies possible correlation among multiple independent variables and not just two as in the case of a simple correlation coefficient. Consider the model: \[y_i = \beta_0 + \beta_k x_{ik} + \epsilon_i\] The estimated variances of the coefficient \(\beta_k\) is written as \[Var(\beta_k)^* = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2}\] Without any multicollinearity, this variance is minimized. If some some independent variables are correlated with the independent variable \(k\), then \[Var(\beta_k) = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2} \cdot \frac{1}{1-R^2_k}\] where \(R^2_k\) is the \(R^2\) if variable \(x_k\) is taken as the dependent variable. The VIF can be written as \[\frac{Var(\beta_k)}{Var(\beta_k)^*}=\frac{1}{1-R^2_k}\] If \(VIF=1\), then there is no relationship between the variable \(x_k\) and the remaining independent variables. Otherwise, \(VIF>1\). In general, the interpretation is as follows:
- VIF of 4 warrants attention
- VIF of 10 indicates a serious problem.
14.2.2 Examples
To illustrate the concept of multicollinearity, the data set from nfl
is used (Berri et al. (2011)). The first model includes total salary as the dependent variable and the following independent variables: prior season passing yards, pass attempts, experience (squared) in the league, draft round pick, veteran (more than 3 years in the league), pro bowl appearance, and facial symmetry.
bhat = lm(log(total)~yards+att+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)
After estimating the results, the function vif()
from the package car
is used:
## yards att exp exp2 draft1 draft2 veteran changeteam pbowlever symm
## 32.547700 30.920282 39.889877 26.715342 1.621048 1.228091 5.253525 1.194254 1.581753 1.056661
The results indicate multicollinearity for yards, att, and experience. Passings yards and attempts may be correlated and thus, one of them (att) is dropped.
bhat = lm(log(total)~yards+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)
## yards exp exp2 draft1 draft2 veteran changeteam pbowlever symm
## 1.460849 39.339639 26.162804 1.616171 1.227479 5.253502 1.141435 1.569621 1.052906
This improves the estimation but experience (and its squared term) are still problematic. The last estimation removes experience and the VIF terms are now in the acceptable range.
## yards draft1 draft2 veteran changeteam pbowlever symm
## 1.406241 1.653634 1.229459 1.976506 1.101988 1.406095 1.010855
The important part is that the conclusion of the paper has not changed with regard to facial symmetry.