## 14.2 Multicollinearity

Multicollinearity describes the situation in which two or more independent variables are linearly related. Under perfect multicollinearity: \[\lambda_1 x_1 + \lambda_2 x_2 + \dots +\lambda_k x_k = 0\] where \(\lambda_i\) are constants that are not all zero simultaneously. For example, consider \(x_1=\{8,12,15,17\}\), \(x_2=\{24,36,45,51\}\), and \(x_3=\{2,3,3.75,4.25\}\). In this case, \(\lambda_1=1\), \(\lambda_2=-1/5\), and \(\lambda_3=2\). Note, multicollinearity refers to linear relationships! Including a squared or cubed term is not an issue of multicollinearity. It can be shown that the variance of the estimator increases in the presence of multicollinearity. There are various indications that the data suffers from multicollinearity:

- High \(R^2\) but few significant variables
- Fail to reject the hypothesis for H\(_0\): \(\beta_i=0\) based on t-values but rejection all slopes being simultaneously zero based on F-test.
- High correlation among explanatory variables
- Variation of statistically significant variables between models.

### 14.2.1 Variance Inflated Factors (VIF)

Identifies possible correlation among multiple independent variables and not just two as in the case of a simple correlation coefficient. Consider the model: \[y_i = \beta_0 + \beta_k x_{ik} + \epsilon_i\] The estimated variances of the coefficient \(\beta_k\) is written as \[Var(\beta_k)^* = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2}\] Without any multicollinearity, this variance is minimized. If some some independent variables are correlated with the independent variable \(k\), then \[Var(\beta_k) = \frac{\sigma^2}{\sum_{i=1}^N (x_{ik}-\bar{x}_k)^2} \cdot \frac{1}{1-R^2_k}\] where \(R^2_k\) is the \(R^2\) if variable \(x_k\) is taken as the dependent variable. The VIF can be written as \[\frac{Var(\beta_k)}{Var(\beta_k)^*}=\frac{1}{1-R^2_k}\] If \(VIF=1\), then there is no relationship between the variable \(x_k\) and the remaining independent variables. Otherwise, \(VIF>1\). In general, the interpretation is as follows:

- VIF of 4 warrants attention
- VIF of 10 indicates a serious problem.

### 14.2.2 Examples

To illustrate the concept of multicollinearity, the data set from `nfl`

is used (Berri et al. (2011)). The first model includes total salary as the dependent variable and the following independent variables: prior season passing yards, pass attempts, experience (squared) in the league, draft round pick, veteran (more than 3 years in the league), pro bowl appearance, and facial symmetry.

```
bhat = lm(log(total)~yards+att+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)
```

After estimating the results, the function `vif()`

from the package `car`

is used:

```
## yards att exp exp2 draft1 draft2 veteran changeteam pbowlever symm
## 32.547700 30.920282 39.889877 26.715342 1.621048 1.228091 5.253525 1.194254 1.581753 1.056661
```

The results indicate multicollinearity for *yards*, *att*, and experience. Passings yards and attempts may be correlated and thus, one of them (*att*) is dropped.

```
bhat = lm(log(total)~yards+exp+exp2+draft1+draft2+veteran+changeteam+pbowlever+symm,data=nfl)
summary(bhat)
```

```
## yards exp exp2 draft1 draft2 veteran changeteam pbowlever symm
## 1.460849 39.339639 26.162804 1.616171 1.227479 5.253502 1.141435 1.569621 1.052906
```

This improves the estimation but experience (and its squared term) are still problematic. The last estimation removes experience and the VIF terms are now in the acceptable range.

```
## yards draft1 draft2 veteran changeteam pbowlever symm
## 1.406241 1.653634 1.229459 1.976506 1.101988 1.406095 1.010855
```

The important part is that the conclusion of the paper has not changed with regard to facial symmetry.