9.5 Exercises

Milk Containers (*): A bottling machine fills one-gallon containers with 128 fluid ounces of milk. You suspect that there is some variation in the amount filled and you take measurements from 50 containers. The measurements are in the data set milk. Test the null hypothesis that the machine fills the containers with more than 128 fluid ounces.
Soda Cans II (*): Consider a machine filling soda cans with a reported average of 360 milliliters (mL). The amounts filled into the cans follow a normal distribution with (unknown) mean $\mu$ and standard deviation $\sigma$. You take a sample of soda cans and measure the volume. Your data (in mL) is found in data set soda. Test the hypothesis (at the 5% significance level) that the machine fills cans with more than 360 mL.
Paper Mill II (*): The local paper mill claims that it does not discharge more than 1000 gallons of waste water into the White River. An environmental interest group measures the discharge over one week and the data is reported to you in the data set discharge. Formulate and test the hypothesis with regard to the claims of the paper mill.
Meridian Hills II (*): The data set mh1 contains home values of 101 homes in the Meridian Hills area in Indianapolis. Test the hypothesis that the home values are greater than $500,000.
HDI (***): The United Nations Development Programme (UNDP) creates an annual Human Development Report (HRD) including a Human Development Index (HDI). It attempts to measures quality of life in various countries. According to UNDP: “Human development – or the human development approach – is about expanding the richness of human life, rather than simply the richness of the economy in which human beings live. It is an approach that is focused on people and their opportunities and choices.” Go to the UNDP data webpage and download the 2019 HDR tables. You can either click on “Download 2019 Human Development Data All Tables and Dashboards” on the data webpage or you can here. For this question, you only need the data contained in sheet “Table 1”:
- The second to last column is named GNI per capita rank minus HDI rank. Interpret the meaning of the column. What does a negative/positive value mean?
- Construct a scatter plot with Gross national income (GNI) per capita on the horizontal axis and Human development index (HDI) on the vertical axis. What do you observe and what can be concluded?
- Subset the original data into two groups. The first group contains the top 10 countries in terms of income. The second group contains the countries ranked 11-20 in terms of income. You can do this separation in Excel. Is there a statistically significant difference in HDI between those two groups?
- Subset the original data into two groups. The first group contains the top 20 countries in terms of income. The second group contains the countries ranked 21-40 in terms of income. Is there a statistically significant difference in HDI between those two groups?
- Compare your answers from parts (3) and (4). What do you conclude?
Airlines (***): You will analyze airline delay data from the Bureau of Transportation Statistics. The data is contained in the data set airlines. Pick a random airport (except Indianapolis). Answer the following questions:
- We are going to focus on the three major carriers: American Airlines, Delta Air Lines, and United Air Lines. Note that United Air Lines is the result of a merger from United and Continental in 2011. The records for Continental Air Lines in the data set stops in that year. To make the data for United comparable over the entire time frame, add the $arr\_flights$ and $arr_del15* numbers for United and Continental between 2003 and 2011. That is, we are looking at the merged company over the entire time horizon.
- Create a column called $delay$ which represents the share of flights delayed by airline, month, and year. Use the columns $arr\_flights$ and $arr\_del15$ for this calculations. Graph the share of delayed arrivals (i.e., delay) for the three carriers over time. Is there a pattern? For example, is it upward trending or downward trending. Is one airline consistently worse than others? Is an airline improving over time compared to others?
- Using the data from January 2014 to today, do a boxplot using the $delay$ column grouped by the three airlines.
- Do three two-sample hypothesis tests using the $delay$ data from January 2014 to today: (1) United vs. Delta, (2) Delta vs. American, and (3) American vs. United. The null hypothesis for all three tests is that there is no difference in delays. Report and interpret your results.
Compact Cars (**): Consider the data in compactcars. For a long time, cars with a manual transmission were more fuel efficient than cars with an automatic transmission. This has changed in recent years due to improvements for automatic transmissions. In this exercise, you will conduct two paired hypothesis tests: one for compact cars of the 1995 model year and one for the 2015 model year. The data set contains only vehicles and models of the EPA category Compact Cars for which the identical model was available with either automatic or manual transmission. Conduct a paired hypothesis test for 1995 and 2015 with the null hypothesis that there is no difference in fuel efficiency. Based on your calculations, what do you conclude? Note that you are not conducting a hypothesis test to compare the 1995 and 2015 fuel efficiency. It is fairly intuitive and clear that the fuel efficiency has improved over that time period.
Automatic vs. Manual Transmission (**): This question is based on the same motivation than the question “Compact Cars”. Consider the data in fetransmission. Pick a vehicle class of your choice as well as one year in the 1980s and one year in the 2010s. Conduct a paired hypothesis test (individually for each year) with the null hypothesis that there is no difference in fuel economy. Based on your calculations, what do you conclude?
Green Laws (**): Go to the data repository of the General Social Survey (GSS). Read through page to familiarize you with the GSS. This data goes beyond the homework but could be useful to you in the future either for work or if you are interested in a particular question about public opinions. If you are interested in a particular topic, go to Browse Variables. For this question, search for the variable GRNLAWS.
- What is the question associated with this variable and which years are covered?
- Construct the 95% confidence interval for the years covered by this question. Interpret in context. Can you conclude whether or not a majority or minority of the population would answer yes?
- How has this variable evolved over the years? Make sure to report the share of of respondents in favor. Include a graph with time on the horizontal axis.
Ohio Schools I (***): The data set ohioincome and ohioscore contain information about the school districts in Ohio with regard to enrollment, overall school performance (think of that as a measure of how good a school is), and median income. First, merge the two data sets based on IRN (serving as an identifier in the two data sets) using the R command merge(). Test the hypothesis that there is no difference in performance for the top 25% and bottom 25% of schools in terms of median income. That is, you are testing the hypothesis that low median income and high median income school districts are performing equally well.
Basel (***): Consider the housing data in basel, which contains home values in some town in the Swiss canton of Basel-Landschaft. Calculate the cost per square meter of living area in a new column. Next, add a column that indicates the house being located in Liestal (canton’s capital) with a 1 and 0 otherwise. Conduct a hypothesis test that the houses in the canton’s capital are as expensive as in surrounding towns.
Fear and Guns (***): Consider the variables $fear$ and $owngun$ in the dataset gss. For the year 2022, test the hypothesis that people who are afraid to walk within one mile of where they life alone at night are more likely to own a firearm.
Domestic Vehicle Preferences (**): Consider the data in nhtsveh, specifically the variables $census\_d$ and $make$. It is very important that you consult the codebook for the 2022 NHTS available here. Test the hypothesis that the share of domestic car manufacturers amongst households is identical in each state. Make sure to conduct those tests in a pairwise manner (but not a paired hypothesis test).