14.1 Heteroscedasticity

A key assumption of the OLS model is homoscedasticity error terms. That is, the error variance is constant: \[Var(\epsilon_i) = \sigma^2\] With heteroscedasticity, the variance of the error term is not constant: \[Var(\epsilon_i) = \sigma_i^2\] For a bivariate regression model with heteroscedastic data, it can be shown that \[Var(\hat{\beta_1}) = \frac{\sum x_i^2 \sigma_i^2}{(\sum x_i^2)^2}\] This is different from the variance of the coefficient estimate under homoscedasticity: \[Var(\hat{\beta_1}) = \frac{\sigma^2}{\sum x_i^2}\] Unbiasedness of the OLS estimator is not affected but the variance of \(\beta_1\) will be larger compared to other estimators. Note that the measure of \(R^2\) is unaffected by heteroscedasticity. Homoscedasticity is needed to justify the t-test, F-test, and confidence intervals. The F-statistic does no longer have an F-distribution. In short, hypothesis tests on the \(\beta\)-coefficients are no longer valid.

If \(\sigma_i^2\) was known, the use of a Generalized Least Squares (GLS) model would be appropriate: \[y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i\] Dividing both sides by the known variance: \[\frac{y_i}{\sigma_i}=\beta_0 \cdot \frac{1}{\sigma_i}+\beta_1 \frac{x_i}{\sigma_i}+\frac{\epsilon_i}{\sigma_i}\] If \(\epsilon^*_i = \epsilon_i / \sigma_i\), then it can be shown that \(Var(\epsilon^*_i)=1\), i.e., constant. Under the usual OLS model: \[\sum_{i=1}^N e_i^2=\sum_{i=1}^N \left(y_i-\hat{\beta}_0+\hat{\beta}_1 \cdot x_i \right)^2\] Under GLS model: \[\sum_{i=1}^N w_i e_i^2= \sum_{i=1}^N w_i \left(y_i-\hat{\beta}_0+\hat{\beta}_1 \cdot x_i \right)^2\] That is, GLS minimizes the weighted sum of the residual squares. Since in reality, the variance of \(\sigma^2\) is not known, other techniques have to be employed to obtain so-called heteroscedasticity-consistent (HC) standard errors. But first, two tests are introduced to detect heteroscedasticity.

14.1.1 Detecting Heteroscedasticity

Two test are presented to detect heteroscedasticity:

Goldfeld-Quandt Test (1965)
Breusch-Pagan-Godfrey Test (1979)

The steps necessary for the Goldfeld-Quandt Test are as follows:

Sort observations by ascending order of the dependent variable.
Pick C as the number of central observations to drop in the middle of the dependent variable.
Run two separate regression equations, i.e., with the “lower” and “upper” part.
Compute \[\lambda = \frac{RSS_2/df}{RSS_1/df}\]
\(\lambda\) follows an F-distribution.

The Goldfeld-Quandt Test can be illustrated with gqtestdata. In a first step, the data is separated into two groups with \(C=6\). In a second step, both groups are used to run a regression. And lastly, \(\lambda\) is calculated.

gqtestdata1         = gqtestdata[1:22,]
gqtestdata2         = gqtestdata[29:50,]
bhat1               = lm(price~sqft,data=gqtestdata1)
bhat2               = lm(price~sqft,data=gqtestdata2)
lambda              = sum(bhat2$residuals^2)/sum(bhat1$residuals^2)

Of course, there is also a function in R called gqtest which simplifies the procedure.

library(lmtest)
bhat                = lm(price~sqft,data=gqtestdata)
gqtest(bhat,fraction = 6)

## 
##  Goldfeld-Quandt test
## 
## data:  bhat
## GQ = 6.6719, df1 = 20, df2 = 20, p-value = 4.227e-05
## alternative hypothesis: variance increases from segment 1 to 2

In any case, the hypothesis of homoscedasticity is rejected for gqtestdata.

The Breusch-Pagan-Godfrey Test is an alternative and does not rely on choosing C as the number of central observations to be dropped. The steps include the following:

Run a regular OLS model and obtain the residuals.
Calculate \[\hat{\sigma}^2 = \frac{\sum_{i=1}^N e^2_i}{N}\]
Construct the variable \(p_i\) as follows: \(p_i = e^2_i / \hat{\sigma}^2\)
Regress \(p_i\) on the X’s as follows \[p_i = \alpha_0 + \alpha_1 \cdot x_{i1}+\alpha_2 \cdot x_{i2} + \dots\]
Obtain the explained sum of squares (ESS) and define \(\Theta = 0.5 \cdot ESS\). Then \(\Theta \sim \chi^2_{m-1}\).

The much simpler procedure is to use the function bptest() in R.

library(lmtest)
bhat                = lm(price~sqft,data=gqtestdata)
bptest(bhat)

## 
##  studentized Breusch-Pagan test
## 
## data:  bhat
## BP = 16.977, df = 1, p-value = 3.783e-05

14.1.2 Correcting Heteroscedasticity

To correct for heteroscedasticity, robust standard errors must be obtained.

bhat = lm(price~sqft,data=gqtestdata)
summary(bhat)
vcov = vcovHC(bhat)
coeftest(bhat,vcov.=vcov)

Note that there are multiple variations to calculate the standard error and thus, it is possible for slight variations among the results from different packages. \[Var(\hat{\beta}_1) = \frac{\sum_{i=1}^N (x_i-\bar{x})^2 e_i^2}{\sum_{i=1}^N (x_i-\bar{x})^2}\] The square root of the following equation is called heteroscedastic robust standard error: \[\widehat{Var}(\hat{\beta}_j) = \frac{\sum_{i=1}^N \hat{r}^2_{ij} e_i^2}{\sum_{i=1}^N (x_i-\bar{x})^2}\] Standard errors can be either larger or smaller. Note that in this example, we do not know whether heteroscedasticity is present or not.