3.4 Data in R

The main data types which can appear in the Environment window of R are:

Vectors
- preselection = seq(1788,2016,4)
- midterm = seq(by=4,to=2018,from=1790)
Matrix
- somematrix = matrix(8,10,4)
- Only numerical values are allowed.
Data frames
- By far, the most common data type in R.
- Comparable to an Excel sheet.
- More on this later.
Lists
- Collection of objects from of various types.
- myfirstlist = list(preselection,midterm,somematrix)

3.4.1 Using R as a Calculator

Entering heights of people and storing it in a vector named height:

height = c(71,77,70,73,66,69,73,73,75,76)

Calculating the sum, product, natural log, mean, and (element-wise) squaring is done with the following commands:

sum(height)
prod(height)
log(height) # Default is the natural log
meanheight = mean(height)
heightsq = height^2

Removing (i.e., deleting) unused elements: rm(heightsq,meanheight)

3.4.2 Creating a Data Frame from Scratch

Data frames are the most commonly used tables in R/RStudio. They are similar to an Excel sheet.

Column names represent the variables and rows represent observations.
Column names must be unique and without spaces.

Suggestion: Use only lower-case variable names and objects.

studentid       = 1:10
studentnames    = c("Andrew","Linda","William","Daniel","Gina",
                    "Mick","Sonny","Wilbur","Elisabeth","James")
students        = data.frame(studentid,studentnames,height)
rm(studentid,height,studentnames)

3.4.3 In-class Exercise 2

Create a data frame called students containing the following information:

Name	Economics	English
Mindy	80.0	52.5
Gregory	60.0	60.0
Shubra	95.0	77.5
Keith	77.5	30.0
Louisa	97.5	95.0

Notes:

Use $name$ as the column header for the students’ names.
Once you have created the data frame, remove the unused vectors.

3.4.4 Indexing

Indexing refers to identifying elements in your data:

For most objects: students[row number,coloumn number]
- students[3,2] returns 95. What does students[3,] return?
If you want to select certain columns: students[c("name")]
- Other example: students[c("name","english")]
Selecting results based on certain conditions: students[which(students$economics>80),]

Referring to a particular column in a data frame is done through the dollar symbol:

students$english
You will use this functionality very often.

Creating a new column: students$average = rowMeans(students[c("economics","english")])

3.4.5 Importing Data into R

In almost all cases, the data is imported into R from an external data set. The data has to be “machine-readable” which means that the first row must contain the variable names and the actual data starts in the second row. Machine-readable data can be imported as follows:

read.csv("filename.csv"): If you have a comma separated value (.csv) file then this is the easiest and preferred way to import data.
readWorkbook(file="filename.xlsx",sheet="sheet name"): Requires the package openxlsx. Note that there are many packages reading Excel and this is one of the most reliable and user-friendly.
Importing data from other software packages (e.g., SAS, Stata, Minitab, SPSS) or .dbf (database) files can be achieved using the package foreign. The package works also for Stata data up to version 12. To important data from Stata version 13 and above, the package readstata13.

3.4.6 Sub-setting a Data Frame

To extract variables or observations based on certain criteria, the command subset() must be used. Consider the data vehicles. Extracting vehicle information only for the year 2015 is done as follows:

cars2015 = subset(vehicles,year==2015)

Note that the double equal sign conducts a logical test. This is similar to Stata. Using a single equal sign does not extract any data and simply returns the original data without (!) an error message.

To list all the distinct values in a column, the command unique() can be used. This command only makes sense in the case of categorical data in a particular column. For example, listing all EPA vehicle size classes ($VClass$) can be accomplished as follows:

unique(cars2015$VClass)

Suppose you are only interested in the variables $ghgScore$ and $VClass$ for the model year 2015.

cars2015 = subset(vehicles,year==2015,select=c("ghgScore","VClass"))

Suppose you are only interested in “Compact Cars” and “Large Cars” in the column $VClass$ for the year 2015. There the notation is a bit odd (note that the many line breaks are not necessary to include in R):

cars = subset(vehicles,
              year==2015 & vclass %in% c("Compact Cars", "Large Cars"),
              select=c("make","co2tailpipegpm","vclass"))

3.4.7 In-class Exercises 3

From the vehicles data set, extract the GHG Score and the vehicle class from the 2014 model year for the following manufacturers: Toyota, Ford, and Audi. Your new data set should contain the following columns: $ghgScore$, $make$, and $VClass$. Is the resulting data frame sensible or do you see a problem?

3.4.8 Aggregating Data

To aggregate data based on a function, e.g., sum or mean:

cars = aggregate(cars$co2tailpipegpm,FUN=mean,by=list(cars$make,cars$vclass))

3.4.9 Writing Data Frame to .csv-File

To write data to the current working directory:

write.csv(cars2014,"cars2014.csv",row.names=FALSE)

Using the option row.names=FALSE avoids an index column in the output file.

3.4.10 Reshaping Data from Long to Wide and Viceversa

R has the ability to reshape data from long to wide format and back. For this demonstration, we use the data in compactcars and the command reshape() from the package reshape2:

cars = melt(compactcars,id=c("year","make","model","displ","drive"))

Reshaping the data is generally very useful but also tricky. For detailed information see the section How can I reshape my data in R.

3.4.11 Extending the Basic `table()` Function

The required package to extend the basic table() function is called gmodels. Compare the outputs of the functions table() and CrossTable(). The commands below do the following:

Standard table() function.
The function CrossTable() includes the proportions along the two dimensions of gun ownership and gender. Note that a description of the cell content is at the top of the results page.

df        = subset(gss,year==2022,
                   select=c("owngun","sex"))
table(df$owngun,df$sex)

##    
##       1   2
##   1 424 331
##   2 623 879
##   3  27  17

CrossTable(df$owngun,df$sex,prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2301 
## 
##  
##              | df$sex 
##    df$owngun |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |       424 |       331 |       755 | 
##              |     0.562 |     0.438 |     0.328 | 
##              |     0.395 |     0.270 |           | 
##              |     0.184 |     0.144 |           | 
## -------------|-----------|-----------|-----------|
##            2 |       623 |       879 |      1502 | 
##              |     0.415 |     0.585 |     0.653 | 
##              |     0.580 |     0.716 |           | 
##              |     0.271 |     0.382 |           | 
## -------------|-----------|-----------|-----------|
##            3 |        27 |        17 |        44 | 
##              |     0.614 |     0.386 |     0.019 | 
##              |     0.025 |     0.014 |           | 
##              |     0.012 |     0.007 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1074 |      1227 |      2301 | 
##              |     0.467 |     0.533 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Note that for almost any R command, you can store the output by assigning it to an object:

somename = CrossTable(gssgun$owngun,gssgun$sex,prop.chisq=FALSE)

3.4.12 Merging Datasets

Consider two data sets from school districts in Ohio:

ohioscore contains an identifier column IRN and a score that indicates the quality of the school.
ohioincome contains the same identifier than the previous sheet in addition to median household income and enrollment.

To merge the two data frames based on the column $IRN$, the function merge() must be used:

ohioschool = merge(ohioscore,ohioincome,by=c("irn"))