8 Confidence Intervals

This chapter introduces the concept of confidence intervals and interval estimation. An interval estimation is often more useful than a point estimation of the unknown population parameter \(\theta\) because it gives an interval of numbers within which the parameter value could fall. There are slides and a YouTube Video associated with this chapter:

This chapter on confidence intervals as well as the subsequent chapter on hypothesis testing makes significant use of the t-Distribution. The aforementioned interval estimation allows us to put lower and upper boundaries on the parameter, i.e., \(\hat{\theta}_l < \theta < \hat{\theta}_u\). Usually a probability value of 95% is used.

\[ Pr(\hat{\theta}_l < \theta < \hat{\theta}_u) = 1 - \alpha\]

The 95% confidence interval for a parameter is an interval calculated from sample data by a method that has a 95% probability of producing an interval containing the true parameter value. Suppose a population has a known mean of 65, e.g., the height of women in the United States. Taking 100 samples of women from the population and calculating a confidence interval given the methods described below, the true mean of 65 will be included (on average) in 95 of those confidence intervals. Note that the following statement is not correct:

The probability that the unknown parameter is contained within a 95% confidence interval is 95%.

Note that there is no way of knowing if the confidence interval actually covers the true parameter. Remember that we have \(E(\bar{x})=\mu\) and \(Var(\bar{x})=\sigma^2/n\). The mean \(\pm 1.96\) standard deviations includes 95% of the normal distribution. Because the sampling distribution is approximately normal (recall this fact from the Central Limit Theorem and the law of large numbers), the distance of 1.96 standard deviations is the margin of error. The margin of error measures how accurate the point estimate is likely to be in estimating a parameter.

This section is designed to illustrate the concept of confidence interval with R. A population of 1 million voters is generated with 55% of those voters favoring candidate A in an upcoming election. For this exercise, a 95% confidence interval is simulated. The script below proceeds as follows:

voters = rbinom(1000000,1,0.55)
output = data.frame(lb=numeric(),ub=numeric(),inside=numeric())
meanA  = mean(voters)
for(i in 1:100){
     poll = sample(voters,1000,replace=FALSE)
     CI   = t.test(poll)
     temp = data.frame(ub=CI$conf.int[1],lb=CI$conf.int[2],inside=0)
     if(CI$conf.int[1]<=meanA & CI$conf.int[2]>=meanA){temp$inside=1}
     output = rbind(output,temp)}
mean(output$inside)
rm(voters,output,meanA,poll,CI,temp)

In a first step, a population of 1 million voters is created. The subsequent steps take 100 samples of 1,000 voters. For each sample, the confidence interval (i.e., lower and upper bounds denoted as lb and ub, respectively) is calculated. The next step checks whether the true mean is contained in the confidence interval. If the above code is executed, the output of mean() should be around 0.95.