Chapter 12 Spatial Data Handling
12.1 Learning Objectives
- Read data from an open data portal
- Manipulate non-spatial data in R
- Convert data from lat/lon into a simple features object
- Project spatial data
- Join data based on spatial attributes
- Create a basic choropleth map
12.2 Functions Learned
- loading packages:
library()
- reading from an API:
read.socrata()
- data exploration:
head()
,dim()
,class()
,names()
,is.na()
,plot()
- selecting data:
filter()
,select()
- creating a spatial object:
st_as_sf()
,st_crs()
,read_sf()
,st_geometry()
- projecting data:
st_transform()
Hint: For each new function we go over, type ?
in front of it in the console to pull up the help page.
12.3 Interactive Tutorial
The notebook with a more detailed version of this workshop can be found here.
We will not cover everything in Luc’s detailed tutorial, but we will go over basic commands and common gotchas, in order to help you feel comfortable working through the tutorial.
Please make sure you have the following packages installed:
tidyverse
RSocrata
sf
tmap
lubridate
You can install all of these at once with the following command:
install.packages(c("tidyverse", "RSocrata", "sf", "tmap", "lubridate"))
We start by loading the packages we need for this workshop. Here, I’ve loaded the “tidyverse” package with the library()
function.
library(tidyverse)
Challenge
RSocrata
, sf
, tmap
, and lubridate
packages with the same function?
12.4 Reading data from an open data portal
We then read data about 311 calls from a URL, otherwise known as an API. This is a straightforward way to quickly get data from an open data portal, without having to download and manage the data file locally. Here’s the data documentation site, from the City of Chicago. This will take a while to run, as we’re pulling over 250,000 observations
vehicle_data <- read.socrata("https://data.cityofchicago.org/resource/suj7-cg3j.csv")
Challenge
head()
, dim()
, and class()
functions on the new vehicle_data
R object. What does the data look like? How many observations and variables are there?
12.5 Selecting data to work with
In general, I may only be interested in a subset of the data. I’ll use the filter command to pull out only 311 calls from 2005. That year()
function is pulling out the year from the date column (spreadsheet software often has this functionality as well). The %>%
is known as the “pipe”, and means “take vehicle_data
, and pass it as the first argument to the next function. It comes in handy when I want to perform multiple operations in the”tidyverse".
vehicle_data %>%
filter(year(creation_date) == 2005)
## creation_date status completion_date service_request_number
## 1 2005-02-10 Completed 2017-01-11 05-00213608
## 2 2005-03-30 Completed - Dup 2017-01-11 05-00464206
## 3 2005-06-16 Completed - Dup 2017-01-11 05-00932280
## 4 2005-06-16 Completed 2017-01-11 05-00932409
## 5 2005-06-16 Completed 2017-01-11 05-00933568
## 6 2005-06-17 Completed - Dup 2017-01-11 05-00941492
## 7 2005-08-01 Completed - Dup 2017-01-11 05-01221973
## 8 2005-08-26 Completed - Dup 2017-01-11 05-01396640
## type_of_service_request license_plate vehicle_make_model
## 1 Abandoned Vehicle Complaint 32L5657 Chrysler
## 2 Abandoned Vehicle Complaint
## 3 Abandoned Vehicle Complaint T159050 Honda
## 4 Abandoned Vehicle Complaint
## 5 Abandoned Vehicle Complaint
## 6 Abandoned Vehicle Complaint Toyota
## 7 Abandoned Vehicle Complaint 5823771 Oldsmobile/Cutlass/Ciera
## 8 Abandoned Vehicle Complaint
## vehicle_color current_activity most_recent_action
## 1 Black FVI - Outcome Create Work Order
## 2
## 3 Black FVI - Outcome Create Work Order
## 4 FVI - Outcome Create Work Order
## 5 FVI - Outcome Create Work Order
## 6 Blue FVI - Outcome Create Work Order
## 7 Blue
## 8
## how_many_days_has_the_vehicle_been_reported_as_parked_
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## street_address zip_code x_coordinate y_coordinate ward
## 1 1030 S MENARD AVE 60644 NA NA 29
## 2 917 W EASTWOOD AVE 60640 1169205 1931086 46
## 3 8905 S BRANDON AVE 60617 1198886 1846522 10
## 4 9127 S ELLIS AVE 60619 NA NA 8
## 5 1855 S CLINTON ST 60616 NA NA 25
## 6 4829 W WOLFRAM ST 60641 1143697 1918515 31
## 7 2507 W CERMAK RD NA 1160049 1889322 28
## 8 7321 S PRINCETON AVE 60621 1175560 1856415 6
## police_district community_area ssa latitude longitude
## 1 15 25 NA 41.86824 -87.76946
## 2 23 3 34 41.96628 -87.65349
## 3 4 46 NA 41.73358 -87.54686
## 4 4 47 NA 41.72860 -87.59964
## 5 12 31 NA 41.85588 -87.64027
## 6 25 19 NA 41.93232 -87.74782
## 7 10 30 NA 41.85190 -87.68826
## 8 7 69 NA 41.76110 -87.63196
## location location_address
## 1 POINT (41.86823989759004 -87.76946043478485) <NA>
## 2 POINT (41.96628368822802 -87.65349140566876) <NA>
## 3 POINT (41.73357538693858 -87.5468638464287) <NA>
## 4 POINT (41.72859955416313 -87.59963819790387) <NA>
## 5 POINT (41.85588399821447 -87.64026997646823) <NA>
## 6 POINT (41.93231981463456 -87.74781719104732) <NA>
## 7 POINT (41.85189851760397 -87.6882562971527) <NA>
## 8 POINT (41.76110443455052 -87.63196391292958) <NA>
## location_city location_state location_zip
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## 7 <NA> <NA> <NA>
## 8 <NA> <NA> <NA>
# equivalent to:
# filter(vehicle_data, year(creation_data = 2005))
Challenge
vehicle_sept16
. Check that you’ve done it correctly with head()
and dim()
.
Sometimes, I have more columns in my data than I need. I can choose a few columns, and assign it to a new dataset. First, I get the variable names:
names(vehicle_sept16)
## [1] "creation_date"
## [2] "status"
## [3] "completion_date"
## [4] "service_request_number"
## [5] "type_of_service_request"
## [6] "license_plate"
## [7] "vehicle_make_model"
## [8] "vehicle_color"
## [9] "current_activity"
## [10] "most_recent_action"
## [11] "how_many_days_has_the_vehicle_been_reported_as_parked_"
## [12] "street_address"
## [13] "zip_code"
## [14] "x_coordinate"
## [15] "y_coordinate"
## [16] "ward"
## [17] "police_district"
## [18] "community_area"
## [19] "ssa"
## [20] "latitude"
## [21] "longitude"
## [22] "location"
## [23] "location_address"
## [24] "location_city"
## [25] "location_state"
## [26] "location_zip"
Then, I extend the code I wrote above:
vehicle_final <- vehicle_sept16 %>%
select(location_address, zip_code)
Or, I can even get rid of the intermediate vehicle_sept16
object!
vehicle_final <- vehicle_data %>%
filter(year(creation_date) == 2016,
month(creation_date) == 9) %>%
select(location_address, zip_code)
The columns I selected aren’t going to be that useful in terms of performing spatial analysis. Why? Because they’re human understandings of where something is. In order for a computer to understand how to map something. I need something a bit more specific. If that’s all I had, I’d need to geocode my addresses, but thankfully I already have two columns in there with the information I need.
Challenge
I can also rename my columns as I select them, for easier typing in the future.
vehicle_final <- vehicle_data %>%
filter(year(creation_date) == 2016,
month(creation_date) == 9) %>%
select(comm = community_area,
lon = longitude,
lat = latitude)
Great, now that I’ve narrowed down my dataset, I can convert it into a spatial format accepted by R! (Note: it’s quite normal to need to clean and prepare your data before using it for spatial analysis. That’s a big part of the data analysis workflow.)
12.6 Spatial data time - my favorite part!
The workhouse of the modern R spatial toolkit is the sf
package. I love it a lot.
To convert a table/CSV with latitude and longitude into an sf
format, we use the st_as_sf()
function, which has a few arguments.
vehicle_points <- st_as_sf(vehicle_final,
coords = c("lon", "lat"),
crs = 4326,
agr = "constant")
Uh oh, what happened? I got the following error:
Error in st_as_sf.data.frame(vehicle_final, coords = c("lon", "lat"), : missing values in coordinates not allowed
.
Challenge
filter()
statement with the is.na()
function to remove the missing lon
and lat
points from vehicle_final
? Save it to vehicle_coord
, and check your work with dim()
Let’s try again…
vehicle_points <- st_as_sf(vehicle_coord,
coords = c("lon", "lat"),
crs = 4326,
agr = "constant")
Challenge
class()
of your new vehicle_points
object. Call the plot()
function on it! Also call the st_crs()
function on it.
12.7 On your own
Challenge
12.8 Very important concepts: projection, and spatial join
Two of the most important spatial concepts are projections, and spatial joins (not to be confused with attribute joins). You can read Luc’s notes to understand what’s going on, but here are two graphics to explain what’s going on.
The two key functions you need to know are:
st_transform()
st_join()
Good luck with the rest of the tutorial!