3.4 Data in R
The main data types which can appear in the Environment window of R are:
- Vectors
preselection = seq(1788,2016,4)
midterm = seq(by=4,to=2018,from=1790)
- Matrix
somematrix = matrix(8,10,4)
- Only numerical values are allowed.
- Data frames
- By far, the most common data type in R.
- Comparable to an Excel sheet.
- Lists
- Collection of objects from of various types.
myfirstlist = list(preselection,midterm,somematrix)
3.4.1 Using R as a Calculator
Entering heights of people and storing it in a vector named height
:
Calculating the sum, product, natural log, mean, and (element-wise) squaring is done with the following commands:
sum(height)
prod(height)
log(height)
# Default is the natural logmeanheight = mean(height)
heightsq = height^2
Removing (i.e., deleting) unused elements: rm(heightsq,meanheight)
3.4.2 Creating a Data Frame from Scratch
Data frames are the most commonly used tables in R/RStudio. They are similar to an Excel sheet.
- Column names represent the variables and rows represent observations.
- Column names must be unique and without spaces.
Suggestion: Use only lower-case variable names and objects.
3.4.3 In-class Exercise 2
Create a data frame called students
containing the following information:
Name | Economics | English |
---|---|---|
Mindy | 80.0 | 52.5 |
Gregory | 60.0 | 60.0 |
Shubra | 95.0 | 77.5 |
Keith | 77.5 | 30.0 |
Louisa | 97.5 | 95.0 |
Notes:
- Use \(name\) as the column header for the students’ names.
- Once you have created the data frame, remove the unused vectors.
3.4.4 Indexing
Indexing refers to identifying elements in your data:
- For most objects:
students[row number,coloumn number]
students[3,2]
returns 95. What doesstudents[3,]
return?
- If you want to select certain columns:
students[c("name")]
- Other example:
students[c("name","english")]
- Other example:
- Selecting results based on certain conditions:
students[which(students$economics>80),]
Referring to a particular column in a data frame is done through the dollar symbol:
students$english
- You will use this functionality very often.
Creating a new column: students$average = rowMeans(students[c("economics","english")])
3.4.5 Importing Data into R
In almost all cases, the data is imported into R from an external data set. The data has to be “machine-readable” which means that the first row must contain the variable names and the actual data starts in the second row. Machine-readable data can be imported as follows:
read.csv("filename.csv")
: If you have a comma separated value (.csv) file then this is the easiest and preferred way to import data.readWorkbook(file="filename.xlsx",sheet="sheet name")
: Requires the package openxlsx. Note that there are many packages reading Excel and this is one of the most reliable and user-friendly.- Importing data from other software packages (e.g., SAS, Stata, Minitab, SPSS) or .dbf (database) files can be achieved using the package foreign. The package works also for Stata data up to version 12. To important data from Stata version 13 and above, the package readstata13.
3.4.6 Sub-setting a Data Frame
To extract variables or observations based on certain criteria, the command subset()
must be used. Consider the data vehicles
. Extracting vehicle information only for the year 2015 is done as follows:
cars2015 = subset(vehicles,year==2015)
Note that the double equal sign conducts a logical test. This is similar to Stata. Using a single equal sign does not extract any data and simply returns the original data without (!) an error message.
To list all the distinct values in a column, the command unique()
can be used. This command only makes sense in the case of categorical data in a particular column. For example, listing all EPA vehicle size classes (\(VClass\)) can be accomplished as follows:
unique(cars2015$VClass)
Suppose you are only interested in the variables \(ghgScore\) and \(VClass\) for the model year 2015.
cars2015 = subset(vehicles,year==2015,select=c("ghgScore","VClass"))
Suppose you are only interested in “Compact Cars” and “Large Cars” in the column \(VClass\) for the year 2015. There the notation is a bit odd (note that the many line breaks are not necessary to include in R):
3.4.7 In-class Exercises 3
From the vehicles data set, extract the GHG Score and the vehicle class from the 2014 model year for the following manufacturers: Toyota, Ford, and Audi. Your new data set should contain the following columns: \(ghgScore\), \(make\), and \(VClass\). Is the resulting data frame sensible or do you see a problem?
3.4.9 Writing Data Frame to .csv-File
To write data to the current working directory:
write.csv(cars2014,"cars2014.csv",row.names=FALSE)
Using the option row.names=FALSE
avoids an index column in the output file.
3.4.10 Reshaping Data from Long to Wide and Viceversa
R has the ability to reshape data from long to wide format and back. For this demonstration, we use the data in compactcars
and the command reshape()
from the package reshape2:
Reshaping the data is generally very useful but also tricky. For detailed information see the section How can I reshape my data in R.
3.4.11 Extending the Basic table()
Function
The required package to extend the basic table()
function is called gmodels. Compare the outputs of the functions table()
and CrossTable()
. The commands below do the following:
- Standard
table()
function. - The function
CrossTable()
includes the proportions along the two dimensions of gun ownership and gender. Note that a description of the cell content is at the top of the results page.
##
## female male
## no 741 505
## refused 20 30
## yes 291 302
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1889
##
##
## | gss$sex
## gss$owngun | female | male | Row Total |
## -------------|-----------|-----------|-----------|
## no | 741 | 505 | 1246 |
## | 0.595 | 0.405 | 0.660 |
## | 0.704 | 0.603 | |
## | 0.392 | 0.267 | |
## -------------|-----------|-----------|-----------|
## refused | 20 | 30 | 50 |
## | 0.400 | 0.600 | 0.026 |
## | 0.019 | 0.036 | |
## | 0.011 | 0.016 | |
## -------------|-----------|-----------|-----------|
## yes | 291 | 302 | 593 |
## | 0.491 | 0.509 | 0.314 |
## | 0.277 | 0.361 | |
## | 0.154 | 0.160 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1052 | 837 | 1889 |
## | 0.557 | 0.443 | |
## -------------|-----------|-----------|-----------|
##
##
Note that for almost any R command, you can store the output by assigning it to an object:
somename = CrossTable(gssgun$owngun,gssgun$sex,prop.chisq=FALSE)
3.4.12 Merging Datasets
Consider two data sets from school districts in Ohio:
ohioscore
contains an identifier column IRN and a score that indicates the quality of the school.ohioincome
contains the same identifier than the previous sheet in addition to median household income and enrollment.
To merge the two data frames based on the column \(IRN\), the function merge()
must be used: