13 ANOVA

Analysis of Variance (ANOVA) models (also know as Dummy Variable Regression models) are regressions with only dummy variables. An ANOVA model with two independent variables can be written as follows: \[y_i = \beta_0 + \beta_1 \cdot d_1 + \beta_2 \cdot d_2\] where \(d_1\) and \(d_2\) are dummy variables. Consider the following model using the nfl data for the year 2005: \[total = \beta_0 + \beta_1 \cdot draft1 + \beta_2 \cdot veteran\] where draft1 and veteran are dummy variables. That is, if \(draft1=1\), then the player was selected in the first draft round. If \(veteran=1\), then the player has played multiple seasons in the NFL. To distinguish j categories only j-1 dummy variables are needed. Otherwise, we have perfect multicollinearity. The category without a dummy variable is the base category.

bhat = lm(total~draft1+veteran,data=subset(nfl,year=2005))
## Warning: In subset.data.frame(nfl, year = 2005) :
##  extra argument 'year' will be disregarded
summary(bhat)
## 
## Call:
## lm(formula = total ~ draft1 + veteran, data = subset(nfl, year = 2005))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.340 -1.865 -0.702  0.792 32.429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9999     0.2534   3.945 8.63e-05 ***
## draft1        2.6422     0.4262   6.200 8.81e-10 ***
## veteran       1.6083     0.2820   5.703 1.62e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.083 on 848 degrees of freedom
##   (158 observations deleted due to missingness)
## Multiple R-squared:  0.05191,    Adjusted R-squared:  0.04968 
## F-statistic: 23.22 on 2 and 848 DF,  p-value: 1.526e-10

For a player who was not drafted in the first round and is not a veteran, the income is close to $1 million. Note that both dummy variables are statistically significant. Note that the R-squared is very low.