14.1 Heteroscedasticity
A key assumption of the OLS model is homoscedasticity error terms. That is, the error variance is constant: \[Var(\epsilon_i) = \sigma^2\] With heteroscedasticity, the variance of the error term is not constant: \[Var(\epsilon_i) = \sigma_i^2\] For a bivariate regression model with heteroscedastic data, it can be shown that \[Var(\hat{\beta_1}) = \frac{\sum x_i^2 \sigma_i^2}{(\sum x_i^2)^2}\] This is different from the variance of the coefficient estimate under homoscedasticity: \[Var(\hat{\beta_1}) = \frac{\sigma^2}{\sum x_i^2}\] Unbiasedness of the OLS estimator is not affected but the variance of \(\beta_1\) will be larger compared to other estimators. Note that the measure of \(R^2\) is unaffected by heteroscedasticity. Homoscedasticity is needed to justify the t-test, F-test, and confidence intervals. The F-statistic does no longer have an F-distribution. In short, hypothesis tests on the \(\beta\)-coefficients are no longer valid.
If \(\sigma_i^2\) was known, the use of a Generalized Least Squares (GLS) model would be appropriate: \[y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i\] Dividing both sides by the known variance: \[\frac{y_i}{\sigma_i}=\beta_0 \cdot \frac{1}{\sigma_i}+\beta_1 \frac{x_i}{\sigma_i}+\frac{\epsilon_i}{\sigma_i}\] If \(\epsilon^*_i = \epsilon_i / \sigma_i\), then it can be shown that \(Var(\epsilon^*_i)=1\), i.e., constant. Under the usual OLS model: \[\sum_{i=1}^N e_i^2=\sum_{i=1}^N \left(y_i-\hat{\beta}_0+\hat{\beta}_1 \cdot x_i \right)^2\] Under GLS model: \[\sum_{i=1}^N w_i e_i^2= \sum_{i=1}^N w_i \left(y_i-\hat{\beta}_0+\hat{\beta}_1 \cdot x_i \right)^2\] That is, GLS minimizes the weighted sum of the residual squares. Since in reality, the variance of \(\sigma^2\) is not known, other techniques have to be employed to obtain so-called heteroscedasticity-consistent (HC) standard errors. But first, two tests are introduced to detect heteroscedasticity.
14.1.1 Detecting Heteroscedasticity
Two test are presented to detect heteroscedasticity:
- Goldfeld-Quandt Test (1965)
- Breusch-Pagan-Godfrey Test (1979)
The steps necessary for the Goldfeld-Quandt Test are as follows:
- Sort observations by ascending order of the dependent variable.
- Pick C as the number of central observations to drop in the middle of the dependent variable.
- Run two separate regression equations, i.e., with the “lower” and “upper” part.
- Compute \[\lambda = \frac{RSS_2/df}{RSS_1/df}\]
- \(\lambda\) follows an F-distribution.
The Goldfeld-Quandt Test can be illustrated with gqtestdata
. In a first step, the data is separated into two groups with \(C=6\). In a second step, both groups are used to run a regression. And lastly, \(\lambda\) is calculated.
gqtestdata1 = gqtestdata[1:22,]
gqtestdata2 = gqtestdata[29:50,]
bhat1 = lm(price~sqft,data=gqtestdata1)
bhat2 = lm(price~sqft,data=gqtestdata2)
lambda = sum(bhat2$residuals^2)/sum(bhat1$residuals^2)
Of course, there is also a function in R called gqtest which simplifies the procedure.
##
## Goldfeld-Quandt test
##
## data: bhat
## GQ = 3.4246, df1 = 20, df2 = 20, p-value = 0.004143
## alternative hypothesis: variance increases from segment 1 to 2
In any case, the hypothesis of homoscedasticity is rejected for gqtestdata
.
The Breusch-Pagan-Godfrey Test is an alternative and does not rely on choosing C as the number of central observations to be dropped. The steps include the following:
- Run a regular OLS model and obtain the residuals.
- Calculate \[\hat{\sigma}^2 = \frac{\sum_{i=1}^N e^2_i}{N}\]
- Construct the variable \(p_i\) as follows: \(p_i = e^2_i / \hat{\sigma}^2\)
- Regress \(p_i\) on the X’s as follows \[p_i = \alpha_0 + \alpha_1 \cdot x_{i1}+\alpha_2 \cdot x_{i2} + \dots\]
- Obtain the explained sum of squares (ESS) and define \(\Theta = 0.5 \cdot ESS\). Then \(\Theta \sim \chi^2_{m-1}\).
The much simpler procedure is to use the function bptest()
in R.
##
## studentized Breusch-Pagan test
##
## data: bhat
## BP = 4.9117, df = 1, p-value = 0.02668
14.1.2 Correcting Heteroscedasticity
To correct for heteroscedasticity, robust standard errors must be obtained.
Note that there are multiple variations to calculate the standard error and thus, it is possible for slight variations among the results from different packages. \[Var(\hat{\beta}_1) = \frac{\sum_{i=1}^N (x_i-\bar{x})^2 e_i^2}{\sum_{i=1}^N (x_i-\bar{x})^2}\] The square root of the following equation is called heteroscedastic robust standard error: \[\widehat{Var}(\hat{\beta}_j) = \frac{\sum_{i=1}^N \hat{r}^2_{ij} e_i^2}{\sum_{i=1}^N (x_i-\bar{x})^2}\] Standard errors can be either larger or smaller. Note that in this example, we do not know whether heteroscedasticity is present or not.