18.2 Pooled Ordinary Least Square model
The first example is data from the General Social Survey (GSS). Recall that the GSS is not a panel data set since the respondents change from one year to the next. The years analyzed for this example are the even years between 1974 and 1984. Note that this data set is accompanying the book Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge.
##
## Call:
## lm(formula = kids ~ educ + age + I(age^2) + east + northcentral +
## west + farm + otherrural + town + smallcity + y74 + y76 +
## y78 + y80 + y82 + y84, data = fertil1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1437 -1.0481 -0.1082 0.9450 5.1055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.785069 3.098715 -2.190 0.028758 *
## educ -0.129765 0.018653 -6.957 5.94e-12 ***
## age 0.499186 0.140592 3.551 0.000400 ***
## I(age^2) -0.005436 0.001589 -3.421 0.000648 ***
## east 0.060729 0.132538 0.458 0.646896
## northcentral 0.219568 0.120638 1.820 0.069019 .
## west 0.050807 0.167982 0.302 0.762360
## farm -0.109598 0.149353 -0.734 0.463217
## otherrural -0.199208 0.178270 -1.117 0.264043
## town 0.058579 0.126538 0.463 0.643502
## smallcity 0.221122 0.162964 1.357 0.175095
## y74 0.240846 0.175541 1.372 0.170334
## y76 -0.139590 0.181902 -0.767 0.443012
## y78 -0.104546 0.184622 -0.566 0.571324
## y80 -0.086997 0.185803 -0.468 0.639718
## y82 -0.414209 0.174412 -2.375 0.017723 *
## y84 -0.565326 0.177398 -3.187 0.001479 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.581 on 1112 degrees of freedom
## Multiple R-squared: 0.09941, Adjusted R-squared: 0.08645
## F-statistic: 7.671 on 16 and 1112 DF, p-value: < 2.2e-16
The evolution of fertility rates over time after controlling of other observable factors can be interpreted as follows:
- Base year: 1972
- Negative coefficients indicate a drop in fertility in the early 1980’s Coefficient of \(y82\) (-0.41) indicates that women had on average 0.41 less children, i.e., 100 women had 41 kids less than 1972.
- This drop is independent from education since we are controlling for education.
- More educated women have fewer children
- Assumes that the effect of each explanatory variable remains constant.
The next example uses cps7885
and interacts year dummy with key explanatory variables to see if the effect of that variable has changed over time. That is, the following model is estimated:
\[\ln(wage)=\beta_0+\gamma_0 \cdot y85+\beta_1 \cdot educ+\gamma_1 \cdot y85 \cdot educ+\beta_2 \cdot exper+ \beta_3 \cdot exper^2+\beta_4 \cdot union+\beta_5 \cdot female+\gamma_5 \cdot y85 \cdot female\]
cps7885$y85 = ifelse(cps7885$year==85,1,0)
summary(lm(formula=log(wage)~y85+educ+y85*educ+exper+I(exper^2)+union+female+y85:female,data=cps7885))
##
## Call:
## lm(formula = log(wage) ~ y85 + educ + y85 * educ + exper + I(exper^2) +
## union + female + y85:female, data = cps7885)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56098 -0.25828 0.00864 0.26571 2.11669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.589e-01 9.345e-02 4.911 1.05e-06 ***
## y85 1.178e-01 1.238e-01 0.952 0.3415
## educ 7.472e-02 6.676e-03 11.192 < 2e-16 ***
## exper 2.958e-02 3.567e-03 8.293 3.27e-16 ***
## I(exper^2) -3.994e-04 7.754e-05 -5.151 3.08e-07 ***
## union 2.021e-01 3.029e-02 6.672 4.03e-11 ***
## female -3.167e-01 3.662e-02 -8.648 < 2e-16 ***
## y85:educ 1.846e-02 9.354e-03 1.974 0.0487 *
## y85:female 8.505e-02 5.131e-02 1.658 0.0977 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4127 on 1075 degrees of freedom
## Multiple R-squared: 0.4262, Adjusted R-squared: 0.4219
## F-statistic: 99.8 on 8 and 1075 DF, p-value: < 2.2e-16
This model can be interpreted as follows:
- \(\beta_0\) is the 1978 intercept
- \(\beta_0+\gamma_0\) is the 1985 intercept
- \(\beta_1\) is the return to education in 1978
- \(\beta_1 + \gamma_1\) is the return to education in 1985
- \(\gamma_1\) measures how the return to education has changed over the seven year period
- 1978 return to education: 7.47%
- 1985 return to education: 7.47%+1.85% = 9.32%
- 1978 gender gap: 31.67%
- 1985 gender gap: 31.67% - 8.51% = 23.16%
The last example regarding pooled data illustrates how misleading a regression model can be if executed incorrectly. The data set is called kiel
and is on home values near the location of an garbage incinerator. The important aspect of the data set is that there was no knowledge about the proposed incinerator in 1978. In a first step, the data is separated into the two years:
Next, two regressions for each of the years are estimated.
##
## Call:
## lm(formula = rprice ~ nearinc, data = kiel1981)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60678 -19832 -2997 21139 136754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101308 3093 32.754 < 2e-16 ***
## nearinc -30688 5828 -5.266 5.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31240 on 140 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1594
## F-statistic: 27.73 on 1 and 140 DF, p-value: 5.139e-07
##
## Call:
## lm(formula = rprice ~ nearinc, data = kiel1978)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56517 -16605 -3193 8683 236307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82517 2654 31.094 < 2e-16 ***
## nearinc -18824 4745 -3.968 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29430 on 177 degrees of freedom
## Multiple R-squared: 0.08167, Adjusted R-squared: 0.07648
## F-statistic: 15.74 on 1 and 177 DF, p-value: 0.0001054
The results can be used for a difference-in-difference estimator: -$30,688-(-$18,824)=-$11,864. Expressed differently: \[\hat{\delta}_1 = (price_{81,near}-price_{81,far})-(price_{78,near}-price_{78,far})\] where \(\hat{\delta}_1\) represents the difference over time in average differences in housing prices in the two locations. To determine statistical significance, the following model must be estimated: \[price = \beta_0 + \gamma_0 \cdot y81 + \beta_1 \cdot nearinc + \delta_1 \cdot y81 \cdot nearinc\]
##
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc, data = kiel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60678 -17693 -3031 12483 236307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82517 2727 30.260 < 2e-16 ***
## y81 18790 4050 4.640 5.12e-06 ***
## nearinc -18824 4875 -3.861 0.000137 ***
## y81:nearinc -11864 7457 -1.591 0.112595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30240 on 317 degrees of freedom
## Multiple R-squared: 0.1739, Adjusted R-squared: 0.1661
## F-statistic: 22.25 on 3 and 317 DF, p-value: 4.224e-13
The interpretation of the coefficients is as follows:
- \(\beta_0\): Average home value which is not near the garbage incinerator
- \(\gamma_0 \cdot y81\): Average change in housing values for all homes
- \(\beta_1 \cdot nearinc\): Location effect that is not due to the incinerator
- \(\gamma_1\): Decline in housing values due to incinerator
Include \(age\) and \(age^2\) in the above equation to take advantage of the information provided in the data leads to the following result:
##
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc + age + I(age^2),
## data = kiel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79349 -14431 -1711 10069 201486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.912e+04 2.406e+03 37.039 < 2e-16 ***
## y81 2.132e+04 3.444e+03 6.191 1.86e-09 ***
## nearinc 9.398e+03 4.812e+03 1.953 0.051713 .
## age -1.494e+03 1.319e+02 -11.333 < 2e-16 ***
## I(age^2) 8.691e+00 8.481e-01 10.248 < 2e-16 ***
## y81:nearinc -2.192e+04 6.360e+03 -3.447 0.000644 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25540 on 315 degrees of freedom
## Multiple R-squared: 0.4144, Adjusted R-squared: 0.4052
## F-statistic: 44.59 on 5 and 315 DF, p-value: < 2.2e-16
Other variables such as \(cbd\), \(rooms\), \(area\), \(land\), and \(baths\) can be added as well.
##
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc + age + I(age^2) +
## intst + land + area + rooms + baths, data = kiel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76721 -8885 -252 8433 136649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.381e+04 1.117e+04 1.237 0.21720
## y81 1.393e+04 2.799e+03 4.977 1.07e-06 ***
## nearinc 3.780e+03 4.453e+03 0.849 0.39661
## age -7.395e+02 1.311e+02 -5.639 3.85e-08 ***
## I(age^2) 3.453e+00 8.128e-01 4.248 2.86e-05 ***
## intst -5.386e-01 1.963e-01 -2.743 0.00643 **
## land 1.414e-01 3.108e-02 4.551 7.69e-06 ***
## area 1.809e+01 2.306e+00 7.843 7.16e-14 ***
## rooms 3.304e+03 1.661e+03 1.989 0.04758 *
## baths 6.977e+03 2.581e+03 2.703 0.00725 **
## y81:nearinc -1.418e+04 4.987e+03 -2.843 0.00477 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19620 on 310 degrees of freedom
## Multiple R-squared: 0.66, Adjusted R-squared: 0.6491
## F-statistic: 60.19 on 10 and 310 DF, p-value: < 2.2e-16
In general, the results show that homes have lost 9.3% in values when including additional independent variables and using the natural logarithm of price.