Chapter 12 Spatial Data Handling

12.1 Learning Objectives

  • Read data from an open data portal
  • Manipulate non-spatial data in R
  • Convert data from lat/lon into a simple features object
  • Project spatial data
  • Join data based on spatial attributes
  • Create a basic choropleth map

12.2 Functions Learned

  • loading packages: library()
  • reading from an API: read.socrata()
  • data exploration: head(), dim(), class(), names(), is.na(), plot()
  • selecting data: filter(), select()
  • creating a spatial object: st_as_sf(), st_crs(), read_sf(), st_geometry()
  • projecting data: st_transform()

Hint: For each new function we go over, type ? in front of it in the console to pull up the help page.

12.3 Interactive Tutorial

The notebook with a more detailed version of this workshop can be found here.

We will not cover everything in Luc’s detailed tutorial, but we will go over basic commands and common gotchas, in order to help you feel comfortable working through the tutorial.

Please make sure you have the following packages installed:

  • tidyverse
  • RSocrata
  • sf
  • tmap
  • lubridate

You can install all of these at once with the following command:

install.packages(c("tidyverse", "RSocrata", "sf", "tmap", "lubridate"))

We start by loading the packages we need for this workshop. Here, I’ve loaded the “tidyverse” package with the library() function.

library(tidyverse)

Challenge

Can you load the RSocrata, sf, tmap, and lubridate packages with the same function?

12.4 Reading data from an open data portal

We then read data about 311 calls from a URL, otherwise known as an API. This is a straightforward way to quickly get data from an open data portal, without having to download and manage the data file locally. Here’s the data documentation site, from the City of Chicago. This will take a while to run, as we’re pulling over 250,000 observations

vehicle_data <- read.socrata("https://data.cityofchicago.org/resource/suj7-cg3j.csv")

Challenge

Try calling the head(), dim(), and class() functions on the new vehicle_data R object. What does the data look like? How many observations and variables are there?

12.5 Selecting data to work with

In general, I may only be interested in a subset of the data. I’ll use the filter command to pull out only 311 calls from 2005. That year() function is pulling out the year from the date column (spreadsheet software often has this functionality as well). The %>% is known as the “pipe”, and means “take vehicle_data, and pass it as the first argument to the next function. It comes in handy when I want to perform multiple operations in the”tidyverse".

vehicle_data %>% 
  filter(year(creation_date) == 2005)
##   creation_date          status completion_date service_request_number
## 1    2005-02-10       Completed      2017-01-11            05-00213608
## 2    2005-03-30 Completed - Dup      2017-01-11            05-00464206
## 3    2005-06-16 Completed - Dup      2017-01-11            05-00932280
## 4    2005-06-16       Completed      2017-01-11            05-00932409
## 5    2005-06-16       Completed      2017-01-11            05-00933568
## 6    2005-06-17 Completed - Dup      2017-01-11            05-00941492
## 7    2005-08-01 Completed - Dup      2017-01-11            05-01221973
## 8    2005-08-26 Completed - Dup      2017-01-11            05-01396640
##       type_of_service_request license_plate       vehicle_make_model
## 1 Abandoned Vehicle Complaint       32L5657                 Chrysler
## 2 Abandoned Vehicle Complaint                                       
## 3 Abandoned Vehicle Complaint       T159050                    Honda
## 4 Abandoned Vehicle Complaint                                       
## 5 Abandoned Vehicle Complaint                                       
## 6 Abandoned Vehicle Complaint                                 Toyota
## 7 Abandoned Vehicle Complaint       5823771 Oldsmobile/Cutlass/Ciera
## 8 Abandoned Vehicle Complaint                                       
##   vehicle_color current_activity most_recent_action
## 1         Black    FVI - Outcome  Create Work Order
## 2                                                  
## 3         Black    FVI - Outcome  Create Work Order
## 4                  FVI - Outcome  Create Work Order
## 5                  FVI - Outcome  Create Work Order
## 6          Blue    FVI - Outcome  Create Work Order
## 7          Blue                                    
## 8                                                  
##   how_many_days_has_the_vehicle_been_reported_as_parked_
## 1                                                     NA
## 2                                                     NA
## 3                                                     NA
## 4                                                     NA
## 5                                                     NA
## 6                                                     NA
## 7                                                     NA
## 8                                                     NA
##         street_address zip_code x_coordinate y_coordinate ward
## 1    1030 S MENARD AVE    60644           NA           NA   29
## 2   917 W EASTWOOD AVE    60640      1169205      1931086   46
## 3   8905 S BRANDON AVE    60617      1198886      1846522   10
## 4     9127 S ELLIS AVE    60619           NA           NA    8
## 5    1855 S CLINTON ST    60616           NA           NA   25
## 6    4829 W WOLFRAM ST    60641      1143697      1918515   31
## 7     2507 W CERMAK RD       NA      1160049      1889322   28
## 8 7321 S PRINCETON AVE    60621      1175560      1856415    6
##   police_district community_area ssa latitude longitude
## 1              15             25  NA 41.86824 -87.76946
## 2              23              3  34 41.96628 -87.65349
## 3               4             46  NA 41.73358 -87.54686
## 4               4             47  NA 41.72860 -87.59964
## 5              12             31  NA 41.85588 -87.64027
## 6              25             19  NA 41.93232 -87.74782
## 7              10             30  NA 41.85190 -87.68826
## 8               7             69  NA 41.76110 -87.63196
##                                       location location_address
## 1 POINT (41.86823989759004 -87.76946043478485)             <NA>
## 2 POINT (41.96628368822802 -87.65349140566876)             <NA>
## 3  POINT (41.73357538693858 -87.5468638464287)             <NA>
## 4 POINT (41.72859955416313 -87.59963819790387)             <NA>
## 5 POINT (41.85588399821447 -87.64026997646823)             <NA>
## 6 POINT (41.93231981463456 -87.74781719104732)             <NA>
## 7  POINT (41.85189851760397 -87.6882562971527)             <NA>
## 8 POINT (41.76110443455052 -87.63196391292958)             <NA>
##   location_city location_state location_zip
## 1          <NA>           <NA>         <NA>
## 2          <NA>           <NA>         <NA>
## 3          <NA>           <NA>         <NA>
## 4          <NA>           <NA>         <NA>
## 5          <NA>           <NA>         <NA>
## 6          <NA>           <NA>         <NA>
## 7          <NA>           <NA>         <NA>
## 8          <NA>           <NA>         <NA>
# equivalent to:
# filter(vehicle_data, year(creation_data = 2005))

Challenge

Filter the data so that you only have observations from September 2016 (how do you filter on multiple criteria at once? Look at the documentation!). Save it to a new dataframe called vehicle_sept16. Check that you’ve done it correctly with head() and dim().

Sometimes, I have more columns in my data than I need. I can choose a few columns, and assign it to a new dataset. First, I get the variable names:

names(vehicle_sept16)
##  [1] "creation_date"                                         
##  [2] "status"                                                
##  [3] "completion_date"                                       
##  [4] "service_request_number"                                
##  [5] "type_of_service_request"                               
##  [6] "license_plate"                                         
##  [7] "vehicle_make_model"                                    
##  [8] "vehicle_color"                                         
##  [9] "current_activity"                                      
## [10] "most_recent_action"                                    
## [11] "how_many_days_has_the_vehicle_been_reported_as_parked_"
## [12] "street_address"                                        
## [13] "zip_code"                                              
## [14] "x_coordinate"                                          
## [15] "y_coordinate"                                          
## [16] "ward"                                                  
## [17] "police_district"                                       
## [18] "community_area"                                        
## [19] "ssa"                                                   
## [20] "latitude"                                              
## [21] "longitude"                                             
## [22] "location"                                              
## [23] "location_address"                                      
## [24] "location_city"                                         
## [25] "location_state"                                        
## [26] "location_zip"

Then, I extend the code I wrote above:

vehicle_final <- vehicle_sept16 %>% 
  select(location_address, zip_code)

Or, I can even get rid of the intermediate vehicle_sept16 object!

vehicle_final <- vehicle_data %>% 
  filter(year(creation_date) == 2016,
         month(creation_date) == 9) %>% 
  select(location_address, zip_code)

The columns I selected aren’t going to be that useful in terms of performing spatial analysis. Why? Because they’re human understandings of where something is. In order for a computer to understand how to map something. I need something a bit more specific. If that’s all I had, I’d need to geocode my addresses, but thankfully I already have two columns in there with the information I need.

Challenge

What columns am I interested in? Replace the column names with the proper ones.

I can also rename my columns as I select them, for easier typing in the future.

vehicle_final <- vehicle_data %>% 
  filter(year(creation_date) == 2016,
         month(creation_date) == 9) %>% 
  select(comm = community_area,
         lon = longitude,
         lat = latitude)

Great, now that I’ve narrowed down my dataset, I can convert it into a spatial format accepted by R! (Note: it’s quite normal to need to clean and prepare your data before using it for spatial analysis. That’s a big part of the data analysis workflow.)

12.6 Spatial data time - my favorite part!

The workhouse of the modern R spatial toolkit is the sf package. I love it a lot.

To convert a table/CSV with latitude and longitude into an sf format, we use the st_as_sf() function, which has a few arguments.

vehicle_points <- st_as_sf(vehicle_final, 
                          coords = c("lon", "lat"), 
                          crs = 4326, 
                          agr = "constant")

Uh oh, what happened? I got the following error: Error in st_as_sf.data.frame(vehicle_final, coords = c("lon", "lat"), : missing values in coordinates not allowed.

Challenge

I can’t have missing values in my longitude or latitude values! Can you write a filter() statement with the is.na() function to remove the missing lon and lat points from vehicle_final? Save it to vehicle_coord, and check your work with dim()

Let’s try again…

vehicle_points <- st_as_sf(vehicle_coord, 
                          coords = c("lon", "lat"), 
                          crs = 4326, 
                          agr = "constant")

Challenge

Check the class() of your new vehicle_points object. Call the plot() function on it! Also call the st_crs() function on it.

12.7 On your own

Challenge

Can you work through the tutorial here to import spatial data that’s not in data frame format? Stop once you’ve plotted the areas.

12.8 Very important concepts: projection, and spatial join

Two of the most important spatial concepts are projections, and spatial joins (not to be confused with attribute joins). You can read Luc’s notes to understand what’s going on, but here are two graphics to explain what’s going on.

The two key functions you need to know are:

  • st_transform()
  • st_join()

Good luck with the rest of the tutorial!