18.2 Pooled Ordinary Least Square model

The first example is data from the General Social Survey (GSS). Recall that the GSS is not a panel data set since the respondents change from one year to the next. The years analyzed for this example are the even years between 1974 and 1984. Note that this data set is accompanying the book Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge.

## 
## Call:
## lm(formula = kids ~ educ + age + I(age^2) + east + northcentral + 
##     west + farm + otherrural + town + smallcity + y74 + y76 + 
##     y78 + y80 + y82 + y84, data = fertil1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1437 -1.0481 -0.1082  0.9450  5.1055 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.785069   3.098715  -2.190 0.028758 *  
## educ         -0.129765   0.018653  -6.957 5.94e-12 ***
## age           0.499186   0.140592   3.551 0.000400 ***
## I(age^2)     -0.005436   0.001589  -3.421 0.000648 ***
## east          0.060729   0.132538   0.458 0.646896    
## northcentral  0.219568   0.120638   1.820 0.069019 .  
## west          0.050807   0.167982   0.302 0.762360    
## farm         -0.109598   0.149353  -0.734 0.463217    
## otherrural   -0.199208   0.178270  -1.117 0.264043    
## town          0.058579   0.126538   0.463 0.643502    
## smallcity     0.221122   0.162964   1.357 0.175095    
## y74           0.240846   0.175541   1.372 0.170334    
## y76          -0.139590   0.181902  -0.767 0.443012    
## y78          -0.104546   0.184622  -0.566 0.571324    
## y80          -0.086997   0.185803  -0.468 0.639718    
## y82          -0.414209   0.174412  -2.375 0.017723 *  
## y84          -0.565326   0.177398  -3.187 0.001479 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.581 on 1112 degrees of freedom
## Multiple R-squared:  0.09941,    Adjusted R-squared:  0.08645 
## F-statistic: 7.671 on 16 and 1112 DF,  p-value: < 2.2e-16

The evolution of fertility rates over time after controlling of other observable factors can be interpreted as follows:

  • Base year: 1972
  • Negative coefficients indicate a drop in fertility in the early 1980’s Coefficient of \(y82\) (-0.41) indicates that women had on average 0.41 less children, i.e., 100 women had 41 kids less than 1972.
  • This drop is independent from education since we are controlling for education.
  • More educated women have fewer children
  • Assumes that the effect of each explanatory variable remains constant.

The next example uses cps7885 and interacts year dummy with key explanatory variables to see if the effect of that variable has changed over time. That is, the following model is estimated: \[\ln(wage)=\beta_0+\gamma_0 \cdot y85+\beta_1 \cdot educ+\gamma_1 \cdot y85 \cdot educ+\beta_2 \cdot exper+ \beta_3 \cdot exper^2+\beta_4 \cdot union+\beta_5 \cdot female+\gamma_5 \cdot y85 \cdot female\]

cps7885$y85    = ifelse(cps7885$year==85,1,0)
summary(lm(formula=log(wage)~y85+educ+y85*educ+exper+I(exper^2)+union+female+y85:female,data=cps7885))
## 
## Call:
## lm(formula = log(wage) ~ y85 + educ + y85 * educ + exper + I(exper^2) + 
##     union + female + y85:female, data = cps7885)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56098 -0.25828  0.00864  0.26571  2.11669 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.589e-01  9.345e-02   4.911 1.05e-06 ***
## y85          1.178e-01  1.238e-01   0.952   0.3415    
## educ         7.472e-02  6.676e-03  11.192  < 2e-16 ***
## exper        2.958e-02  3.567e-03   8.293 3.27e-16 ***
## I(exper^2)  -3.994e-04  7.754e-05  -5.151 3.08e-07 ***
## union        2.021e-01  3.029e-02   6.672 4.03e-11 ***
## female      -3.167e-01  3.662e-02  -8.648  < 2e-16 ***
## y85:educ     1.846e-02  9.354e-03   1.974   0.0487 *  
## y85:female   8.505e-02  5.131e-02   1.658   0.0977 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4127 on 1075 degrees of freedom
## Multiple R-squared:  0.4262, Adjusted R-squared:  0.4219 
## F-statistic:  99.8 on 8 and 1075 DF,  p-value: < 2.2e-16

This model can be interpreted as follows:

  • \(\beta_0\) is the 1978 intercept
  • \(\beta_0+\gamma_0\) is the 1985 intercept
  • \(\beta_1\) is the return to education in 1978
  • \(\beta_1 + \gamma_1\) is the return to education in 1985
  • \(\gamma_1\) measures how the return to education has changed over the seven year period
  • 1978 return to education: 7.47%
  • 1985 return to education: 7.47%+1.85% = 9.32%
  • 1978 gender gap: 31.67%
  • 1985 gender gap: 31.67% - 8.51% = 23.16%

The last example regarding pooled data illustrates how misleading a regression model can be if executed incorrectly. The data set is called kiel and is on home values near the location of an garbage incinerator. The important aspect of the data set is that there was no knowledge about the proposed incinerator in 1978. In a first step, the data is separated into the two years:

kiel1978 = subset(kiel,year==1978)
kiel1981 = subset(kiel,year==1981)

Next, two regressions for each of the years are estimated.

summary(lm(rprice~nearinc,data=kiel1981))
## 
## Call:
## lm(formula = rprice ~ nearinc, data = kiel1981)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60678 -19832  -2997  21139 136754 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   101308       3093  32.754  < 2e-16 ***
## nearinc       -30688       5828  -5.266 5.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31240 on 140 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1594 
## F-statistic: 27.73 on 1 and 140 DF,  p-value: 5.139e-07
summary(lm(rprice~nearinc,data=kiel1978))
## 
## Call:
## lm(formula = rprice ~ nearinc, data = kiel1978)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56517 -16605  -3193   8683 236307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    82517       2654  31.094  < 2e-16 ***
## nearinc       -18824       4745  -3.968 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29430 on 177 degrees of freedom
## Multiple R-squared:  0.08167,    Adjusted R-squared:  0.07648 
## F-statistic: 15.74 on 1 and 177 DF,  p-value: 0.0001054

The results can be used for a difference-in-difference estimator: -$30,688-(-$18,824)=-$11,864. Expressed differently: \[\hat{\delta}_1 = (price_{81,near}-price_{81,far})-(price_{78,near}-price_{78,far})\] where \(\hat{\delta}_1\) represents the difference over time in average differences in housing prices in the two locations. To determine statistical significance, the following model must be estimated: \[price = \beta_0 + \gamma_0 \cdot y81 + \beta_1 \cdot nearinc + \delta_1 \cdot y81 \cdot nearinc\]

summary(lm(rprice~y81+nearinc+y81:nearinc,data=kiel))
## 
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc, data = kiel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60678 -17693  -3031  12483 236307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    82517       2727  30.260  < 2e-16 ***
## y81            18790       4050   4.640 5.12e-06 ***
## nearinc       -18824       4875  -3.861 0.000137 ***
## y81:nearinc   -11864       7457  -1.591 0.112595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30240 on 317 degrees of freedom
## Multiple R-squared:  0.1739, Adjusted R-squared:  0.1661 
## F-statistic: 22.25 on 3 and 317 DF,  p-value: 4.224e-13

The interpretation of the coefficients is as follows:

  • \(\beta_0\): Average home value which is not near the garbage incinerator
  • \(\gamma_0 \cdot y81\): Average change in housing values for all homes
  • \(\beta_1 \cdot nearinc\): Location effect that is not due to the incinerator
  • \(\gamma_1\): Decline in housing values due to incinerator

Include \(age\) and \(age^2\) in the above equation to take advantage of the information provided in the data leads to the following result:

summary(lm(rprice~y81+nearinc+y81:nearinc+age+I(age^2),data=kiel))
## 
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc + age + I(age^2), 
##     data = kiel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79349 -14431  -1711  10069 201486 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.912e+04  2.406e+03  37.039  < 2e-16 ***
## y81          2.132e+04  3.444e+03   6.191 1.86e-09 ***
## nearinc      9.398e+03  4.812e+03   1.953 0.051713 .  
## age         -1.494e+03  1.319e+02 -11.333  < 2e-16 ***
## I(age^2)     8.691e+00  8.481e-01  10.248  < 2e-16 ***
## y81:nearinc -2.192e+04  6.360e+03  -3.447 0.000644 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25540 on 315 degrees of freedom
## Multiple R-squared:  0.4144, Adjusted R-squared:  0.4052 
## F-statistic: 44.59 on 5 and 315 DF,  p-value: < 2.2e-16

Other variables such as \(cbd\), \(rooms\), \(area\), \(land\), and \(baths\) can be added as well.

summary(lm(rprice~y81+nearinc+y81:nearinc+age+I(age^2)+intst+land+area+rooms+baths,data=kiel))
## 
## Call:
## lm(formula = rprice ~ y81 + nearinc + y81:nearinc + age + I(age^2) + 
##     intst + land + area + rooms + baths, data = kiel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -76721  -8885   -252   8433 136649 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.381e+04  1.117e+04   1.237  0.21720    
## y81          1.393e+04  2.799e+03   4.977 1.07e-06 ***
## nearinc      3.780e+03  4.453e+03   0.849  0.39661    
## age         -7.395e+02  1.311e+02  -5.639 3.85e-08 ***
## I(age^2)     3.453e+00  8.128e-01   4.248 2.86e-05 ***
## intst       -5.386e-01  1.963e-01  -2.743  0.00643 ** 
## land         1.414e-01  3.108e-02   4.551 7.69e-06 ***
## area         1.809e+01  2.306e+00   7.843 7.16e-14 ***
## rooms        3.304e+03  1.661e+03   1.989  0.04758 *  
## baths        6.977e+03  2.581e+03   2.703  0.00725 ** 
## y81:nearinc -1.418e+04  4.987e+03  -2.843  0.00477 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19620 on 310 degrees of freedom
## Multiple R-squared:   0.66,  Adjusted R-squared:  0.6491 
## F-statistic: 60.19 on 10 and 310 DF,  p-value: < 2.2e-16

In general, the results show that homes have lost 9.3% in values when including additional independent variables and using the natural logarithm of price.