Introduction

In this class, we will use an R notebook to illustrate the code. We will explore the use of packages, and the tidyverse in particular. The tidyverse consists of a collection of packages to do data wrangling in R. All this functionality also exists in base R, but tidyverse makes it easier, more intuitive, and sometimes also faster. A more extensive discussion of this functionality is given in Chapter 3 of the R for Data Science book.

We will use this package to:

In addition, we will review several summary characteristics of data. We will also work with the package foreign to read and write dbf files, which we will use a lot later on.

Before we start, make sure you have set the current working directory. It should contain the files Community_Pop.csv and Community_Pop.dbf.

Working with Packages

Installing a package on your system

Install a package using the install.packages command. For example, we will be using the tidyverse package. As an alternative to the command line, you can use the Tools > Install Packages … command from the RStudio interface. To install tidyverse, we have to put its name in quotes (note, on my system this is already installed).

install.packages("tidyverse")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/tidyverse_1.2.1.tgz'
Content type 'application/x-gzip' length 88754 bytes (86 KB)
==================================================
downloaded 86 KB

The downloaded binary packages are in
    /var/folders/q_/19vkssyn1c9d1h56wzzl7s0c0000gn/T//Rtmpnbik2a/downloaded_packages

To check which packages you have installed, you can use installed.packages. On my system, this is a very long list, so I will assign it to an object (pp), and then use head to take a peek at the contents.

pp <- installed.packages()
head(pp)
             Package        LibPath                                                         
abind        "abind"        "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
acepack      "acepack"      "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
ade4         "ade4"         "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
adehabitatHR "adehabitatHR" "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
adehabitatHS "adehabitatHS" "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
adehabitatLT "adehabitatLT" "/Library/Frameworks/R.framework/Versions/3.5/Resources/library"
             Version  Priority
abind        "1.4-5"  NA      
acepack      "1.4.1"  NA      
ade4         "1.7-11" NA      
adehabitatHR "0.4.15" NA      
adehabitatHS "0.3.13" NA      
adehabitatLT "0.3.23" NA      
             Depends                                                               
abind        "R (>= 1.5.0)"                                                        
acepack      NA                                                                    
ade4         "R (>= 2.10)"                                                         
adehabitatHR "R (>= 3.0.1), sp, methods, deldir, ade4, adehabitatMA,\nadehabitatLT"
adehabitatHS "R (>= 3.0.1), sp, methods, ade4, adehabitatMA, adehabitatHR"         
adehabitatLT "R (>= 2.10.0), sp, methods, ade4, adehabitatMA, CircStats,\nstats"   
             Imports                                            LinkingTo
abind        "methods, utils"                                   NA       
acepack      NA                                                 NA       
ade4         "graphics, grDevices, methods, stats, utils, MASS" NA       
adehabitatHR "graphics, grDevices, stats"                       NA       
adehabitatHS "graphics, grDevices, stats"                       NA       
adehabitatLT "graphics, grDevices, utils"                       NA       
             Suggests                                                                                                  
abind        NA                                                                                                        
acepack      "testthat"                                                                                                
ade4         "ade4TkGUI, adegraphics, adephylo, ape, CircStats, deldir,\nlattice, pixmap, sp, spdep, splancs, waveslim"
adehabitatHR "maptools, tkrplot, MASS, rgeos"                                                                          
adehabitatHS "maptools, tkrplot, MASS, rgeos"                                                                          
adehabitatLT "maptools, tkrplot, MASS"                                                                                 
             Enhances License              License_is_FOSS License_restricts_use OS_type MD5sum
abind        NA       "LGPL (>= 2)"        NA              NA                    NA      NA    
acepack      NA       "MIT + file LICENSE" NA              NA                    NA      NA    
ade4         NA       "GPL (>= 2)"         NA              NA                    NA      NA    
adehabitatHR NA       "GPL (>= 2)"         NA              NA                    NA      NA    
adehabitatHS NA       "GPL (>= 2)"         NA              NA                    NA      NA    
adehabitatLT NA       "GPL (>= 2)"         NA              NA                    NA      NA    
             NeedsCompilation Built  
abind        "no"             "3.5.0"
acepack      "yes"            "3.5.0"
ade4         "yes"            "3.5.0"
adehabitatHR "yes"            "3.5.0"
adehabitatHS "yes"            "3.5.0"
adehabitatLT "yes"            "3.5.0"

We now also install the package foreign, in the same way as for tidyverse.

install.packages("foreign")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/foreign_0.8-71.tgz'
Content type 'application/x-gzip' length 325064 bytes (317 KB)
==================================================
downloaded 317 KB

The downloaded binary packages are in
    /var/folders/q_/19vkssyn1c9d1h56wzzl7s0c0000gn/T//Rtmpnbik2a/downloaded_packages

Invoking a package

Before we can use the functionality of tidyverse, we need to add it to our library, using the library command. It may seem a little counterintuitive that a package is called a library, but think of the packages as being out there on your system in general, and before you can use them in R, you need to add them to your own currently active library of functions.

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

After you execute this command, you see a list of the different packages that are installed, including readr and dplyr. The Conflicts part means that the packages override the base R commands of the same name. For example, the interpretation of the filter command in dplyr overrides the filter command from stats. If you want to use the latter, you must specify it as stats::filter() (note, double colons). Typically, this will not be necessary, since you have installed the dplyr package for a reason, i.e., to use its filter command, not the other one.

We now also make the foreign package active using the library command.

library(foreign)

This time, there are no additional messages.

Reading and Writing Data

Reading spreadsheet-like data

We have already seen earlier how to read a file with comma separated values (csv) using the base R read.table command. We also noticed some issues when we had character data, the stringsAsFactors option.

The tidyverse has an alternative function to read such files, read_csv. It deals with the stringsAsFactors issue (strings are kept in character type) and some other issues as well. The result is a so-called tibble, which is basically a data.frame with some additional characteristics.

We will again use the Community_Pop.csv file as our input. We will create a data frame (tibble) as df by passing just the file name to the read_csv function (make sure the file name is in quotes).

df <- read_csv("Community_Pop.csv")
Parsed with column specification:
cols(
  NID = col_integer(),
  CommArea = col_character(),
  POP2010 = col_integer(),
  POP2000 = col_integer()
)

The result is somewhat different from what we saw when we used read.csv. Then, nothing happened, and we had to list the contents of df to see what was entered. Here, there is a brief summary of the variables and their types.

Characteristics of a tibble

When we list the just created df, we no longer get the full output, but an abbreviated list, with the types of the variables listed below them. Also, the table is no longer called a data frame (even though it is one), but a tibble.

df

Writing a data frame (or tibble)

Just as we did previously, we can now write out the tibble/data frame to another csv file with the function write_csv (note the underline instead of a period). In contrast to write.csv, we don’t have to worry about the row names (none are written as the default). For example, we can write df out to myfile2.csv (make sure to have it in quotes).

write_csv(df,"myfile2.csv")

From now on, when we read or write csv files, we will use read_csv and write_csv.

Reading and writing dbf files

Reading a dbf file

In many of our applications, the files will not be comma separated, but in the dBase file format, as a binary file. For example, the data associated with many GIS applications will come that way. There is no functionality to read dBase files in base R or in the tidyverse. Instead, we need to use the function read.dbf from the package foreign that we just installed and activated.

It operates in the same way. To create a data frame, you pass the name of the file. Note that this creates a data frame, not a tibble. For example, create the data frame dt from the file Community_Pop.dbf on your current working directory using read.dbf.

dt <- read.dbf("Community_Pop.dbf")

Check the first few lines using the head command.

head(dt)

Turning a data frame into a tibble

To turn a data frame into a tibble, you use as_tibble. For example, to turn dt into the tibble dt2.

dt2 <- as_tibble(dt)

Now, if we print it, we see the familiar data types listed under the variable names.

dt2

Sometimes, we may want to turn a tibble back into a data frame. To that effect, we can use as.data.frame. For example, the write.dbf function only works with data frames, not with tibbles.

Writing a dbf file

We write out our data frame in dbf format using write.dbf, in the same way as before by passing the name of the data frame and the file name (in quotes).

write.dbf(dt,"myfile3.dbf")

Practice

Turn the data contained in the file NYC_Sub_borough_Area.dbf into a tibble called nyc. Check how many observations and variables the data set contains (note, you must turn it into a tibble, not just a data frame).

Working with Variables (Columns)

Selecting variables with select

A column or variable can be selected from a tibble in the same way as for a data frame, using the dollar ($) notation or the double brackets [[ ]]. In addition, with the tidyverse, you can also specify the variable names in a select command, without quotes around the variable name.

For example, to create a new tibble with just the population values, i.e., the variables POP2000 and POP2010, we use select to assign the result to pop. We pass the table df and then a list of variables, separated by comma.

pop <- select(df,POP2000,POP2010)

When we type pop, we see the usual variable type under the variable name.

pop

Renaming variables with rename

Often, the variable names we get in data sources are not that informative or otherwise not exactly what we would want. The rename command lets us change a variable name. Again, we first pass the tibble, and then an expression with new name = old name. One way to remember the order is to think of it as computing a new variable, only the new variable is the same as the old variable, but with a different name.

For example, if we wanted to change CommArea to Community in df and assign the result to
df2 (note, again, no quotes for the variable names).

df2 <- rename(df,Community = CommArea)

Checking the contents of df2.

df2

Creating new variables with mutate

We typically want to carry out computations with the variables in our data frame and add the results to the data frame. This can be done using base R commands, but in the tidyverse it is easily accomplished by means of the mutate command. Again, we pass the data frame and the expression to be calculated as new_variable = expression. To make the addition permanent, we assign the result to a data frame (could be the original data frame).

For example, say we wanted to compute the ten year population change, i.e., the difference between POP2010 and POP2000. We use mutate with df and popdiff = POP2010 - POP2000. We assign the result back to df. We list the contents of df to check.

df <- mutate(df,popdiff = POP2010 - POP2000)
df

Now, let’s also create a logical variable that is TRUE for those community areas with a positive population growth, say popinc = popdiff > 0. Again, we add the variable to the existing data frame by assigning the result of the mutate operation back to df. We print the result to check.

df <- mutate(df,popinc = popdiff > 0)
df

Practice

Create a subset of the nyc data set that contains only the median rent variables (rent2002, rent2005, rent2008), the borough id (code), and name (subborough). Change the variable code to id. Create a new variable with a value of TRUE when the rent is zero.

Working with Observations (Rows)

Subsetting observations with filter

In order to select specific observations that meet a given criterion, we use filter. For example, if we wanted to extract the community areas that had positive population growth, we would use filter on the data frame df with popinc == TRUE (note the double equal sign). We assign the result to a new data frame, say dfpos and print it out to check.

dfpos <- filter(df,popinc == TRUE)
dfpos

The listing reveals that there are 17 rows. Later, we will see a couple of different ways to count how many community areas saw positive population growth.

Practice

Continue with the NYC rent data set just created and eliminate the observations with zero for the rent in any of the years.

Summaries

Summary on individual variables using summary

We have already seen the use of summary earlier. This works the same way on a tibble as on a data frame. For example, the summary statistics for our data set for the community areas are computed using summary.

summary(df)
      NID       CommArea            POP2010         POP2000          popdiff         popinc       
 Min.   : 1   Length:77          Min.   : 2876   Min.   :  3294   Min.   :-19013   Mode :logical  
 1st Qu.:20   Class :character   1st Qu.:18109   1st Qu.: 18165   1st Qu.: -6017   FALSE:60       
 Median :39   Mode  :character   Median :31028   Median : 33694   Median : -1596   TRUE :17       
 Mean   :39                      Mean   :35008   Mean   : 37611   Mean   : -2603                  
 3rd Qu.:58                      3rd Qu.:48743   3rd Qu.: 52723   3rd Qu.:  -192                  
 Max.   :77                      Max.   :98514   Max.   :117527   Max.   : 12895                  

Descriptive statistics using summarize

The summarize command computes specific descriptive statistics and assigns them to a new variable. All the statistics are then combined in a data frame with a single observation. For example, if we wanted the mean of the population in each of the two years, we create two new variables, say, m00 and m10 and set them equal to respectively mean(POP2000) and mean(POP2010). The summarize command takes the name of the data frame and the expressions entered. For example, for the mean.

summarize(df,m00 = mean(POP2000),m10 = mean(POP2010))

The values are the same as what we obtained from the summary. In addition to the mean, summarize can also yield the median, standard deviation (sd), inter-quartile range (IQR), minimum (min), maximum (max), quantile (quantile(variable, percentile)), counts of logical values (sum).

For example, to find out how many community areas have a positive population growth, we apply sum to popinc (remember that TRUE is the same as 1 and FALSE is 0, so the sum of the observations on a logical variable equals the number of TRUE).

summarize(df,c = sum(popinc))

Summaries by subset using group_by

The main power of summarize is its use in combination with group_by, which computes the descriptive statistics for subsets of the data. For example, say we wanted to compute the mean population separately for the community areas that saw growth and those that saw decline.

We could of course use filter to create two separate data frames and repeat the calculation for each of them. Instead, we use group_by to make the grouping internal to the data frame and then apply the summarize.

For example, first we create the data frame dfgroup using group_by with popinc. We also print the result to check it.

dfgroup <- group_by(df,popinc)
dfgroup

We can’t really see any difference. But now, if we use summarize for the two means on the new data frame, the results give us two rows, one for popinc FALSE and one for TRUE.

summarize(dfgroup,m00 = mean(POP2000),m10 = mean(POP2010))

A very useful summary statistic is n() which gives the count of observations by group, contained in the variable count.

summarize(dfgroup,count=n())

Practice

Compute the median of the median rent in 2008 for the NYC sub-boroughs grouped by whether they had above or below median rent in 2002.

Chaining commands

The real power of the various manipulations of data frames using the tidyverse is by chaining commands using the so-called pipe. In the manipulations above we had to create a new data frame each time, but those were not necessarily of interest in and of themselves. The pipe command, symbolized as %>% moves the output of one operation into the input of the next operation.

For example, say we wanted to compute the logical variable popinc and then immediately use that to group the observations and compute the mean by group. So, basically, the same as what we did before. Now we organize these operations by starting with our data frame, piping it into the mutate operation to compute the new variable, then pipe that into the group_by operation and finally into the summarize.

df %>% mutate(popch = (POP2010 - POP2000) > 0) %>%
  group_by(popch) %>% summarize(m00 = mean(POP2000),m10 = mean(POP2010), count = n())

If we want to keep the results as a new data frame, we assign it to a new object.

dfch <- df %>% mutate(popch = (POP2010 - POP2000) > 0) %>%
  group_by(popch) %>% summarize(m00 = mean(POP2000),m10 = mean(POP2010),count=n())
dfch

Practice

Create a new data frame with the summary statistics for the rent in 2005 and 2008 grouped by whether the sub-boroughs were above or below the median in 2002.

