Preliminaries

Make sure to move to the proper working directory and that the file NYC_Sub_borough_Area.dbf is in that directory.

Also, activate the tidyverse and foreign packages using the library command (the tidyverse package includes ggplot2):

library(tidyverse)
library(foreign)

We will continue with the data for the NYC sub-boroughs. If you did not save the data frame from the previous lab, then read it in from NYC_Sub_borough_Area.dbf and add the manbronx variable if you don’t have it. If you saved the data frame using saveRDS, then you can load it directly using readRDS.

Below follow the commands to start from scratch (check the previous lab for explanations):

nyc.data <- read.dbf("NYC_Sub_borough_Area.dbf")
nyc.data <- as_tibble(nyc.data)
nyc.data <- nyc.data %>% mutate(manbronx = if_else((code > 300 & code < 311) | (code > 100 & code < 111),"Select","Rest"))
nyc.data

Spiffing up the Graph

So far, we have used the defaults for the various descriptive aspects of the graph, such as axis labels, title, etc. All these can be specified (in great detail). Only the very basics will be covered here, but the options and combinations are virtually endless.

To illustrate these features, we will continue to use the scatterplot of kids2000 on pubast00, with manbronx as a categorical variable to define subgroups.

Axis labels

The default setting for the axis labels is to use the variable name. Sometimes, this is not very informative. To set the labels explicitly, we use the xlab and ylab functions These are added in the same way as actual layers, using the + notation and with the respective options in parentheses (do not use an = sign).

For example, with our default scatter plot, we can set xlab("Percent HH with Children") and ylab("Percent Public Assistance") (note, the font and font size can be specified as well, but we don’t go that far).

As in the previous lab, we assign the main ggplot command to the object g to save us some typing.

g <- ggplot(data=nyc.data,aes(x=kids2000, y=pubast00))
g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance")

Title

A title is added to the graph by means of the ggtitle command. Again, enter the desired title in parentheses and enclosed by quotes. For example, we can add ggtitle("Example Scatter Plot") (we also keep the axis labels):

g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot")

The default is to have the title left-aligned. Often, one may want it centered above the graph. Again, this can be customized. We can override the basic settings in the theme command. For example, we adjust the plot.title (of course, you need to know what everything is called). Specifically, we set the element_text property’s horizontal justification (hjust) to 0.5. Specifically, we use theme(plot.title = element_text(hjust = 0.5)). This centers the title. The number of other refinements is near infinite and beyond our scope at this point.

g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5))

Legend

The default legend title is the variable name, in our previous examples, that was manbronx. To set the legend title explicitly, we need the labs command. For example, we set aes(color=manbronx) for the points and the linear smoother (method="lm"), and the legend to labs(color="Selection").

g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color="Selection")

Theme

Every graph has a theme, which sets the main parameters for its appearance. The default theme with the grey grids, separated by white lines is theme_grey( ). If we want to change this, we can specify one of the other themes. For example, a classic graph a la base R plot, without background shading or grid lines is theme_classic( ). In order to obtain this specialized look, we set the associated theme command.

g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_classic()

There are seven built-in themes as well as several contributed ones. Another built-in example is theme_minimal( ), shown next.

g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_minimal()

In addition, the package ggthemes contains several additional themes that look extremely professional.

You will have to install the package first, if you don’t have it. Once installed, invoke it with the library command.

library(ggthemes)

Then, for example, using theme_tufte( ):

g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_tufte()

Each of the themes has specific options for further customization, so that the possibilities are nearly limitless.

Conditional Plots

Conditional plots are a major feature of the functionality of ggplot, where they are referred to as facetting, or small multiples. This is implemented in the facet_wrap and facet_grid functions. The difference between the two is that facet_wrap is based on a single conditioning variable (so, essentially one-dimensional), whereas facet_grid typically has two conditioning variables (as a grid of sub-plots).

In ggplot, the conditioning is based on a categorical variable that needs to be available in the data set. The facetting formula does not evaluate functions, so the conditioning categories need to be computed beforehand. In our case, we already have the manbronx variable.

Facet Wrap

The facet_wrap function creates multiple graphs for subsets of the data as determined by a conditioning variable. There are several options, such as explicitly setting the number of rows and columns, but we still stick with the bare defaults to illustrate the principle.

For example, to condition on manbronx, we use facet_wrap(~ manbronx).

g +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~manbronx)

Rearranging the facet graphs

The facet_wrap function has several options to explicitly set the number of rows or columns in which the subgraphs should be displayed. For example, using nrow = 2 arranges the graphs vertically.

g +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~manbronx,nrow=2)

Facet Grid

The main difference between the two facetting approaches is that facet_grid is explicitly two-dimensional. This requires two variables to set the conditions, one of the vertical dimension (y) and one for the horizontal dimension (x).

As mentioned earlier, in ggplot, the conditioning is based on a categorical variable that needs to be available in the data set. In our example, we only have one, so we will need to create a second one.

There are three so-called helper functions to make this easy: cut_interval, cut_width, and cut_number. For example, with cut_number we pass the variable, e.g., hhziz00, and the number of categories, say n = 2. This creates the new variable as an R factor, giving the intervals that resulted from the cut. Note that the variable needs to be accessed in the usual way, using [[ ]] or $ and assigned back to the data frame (this is not a mutate command, although you could create a factor that way as well).

For example, we create a new variable cut.hhsiz using a quantile classification with two categories (the variable will be split on the median value), by setting n=2. We need to use the $ notation to ensure that the new variable is added to the relevant data set. Since we only have 55 observations, we can easily list the full set of values to verify. Internally, they are stored as factors (hence, the summary of the Levels at the end of the listing).

nyc.data$cut.hhsiz <- cut_number(nyc.data$hhsiz00,n=2)
nyc.data$cut.hhsiz
 [1] (2.72,3.2]  [1.57,2.72] (2.72,3.2]  [1.57,2.72] [1.57,2.72] (2.72,3.2]  (2.72,3.2]  [1.57,2.72] [1.57,2.72]
[10] [1.57,2.72] [1.57,2.72] (2.72,3.2]  (2.72,3.2]  [1.57,2.72] (2.72,3.2]  (2.72,3.2]  [1.57,2.72] [1.57,2.72]
[19] [1.57,2.72] [1.57,2.72] [1.57,2.72] [1.57,2.72] [1.57,2.72] [1.57,2.72] [1.57,2.72] [1.57,2.72] (2.72,3.2] 
[28] (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  [1.57,2.72] (2.72,3.2]  [1.57,2.72] [1.57,2.72]
[37] (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  [1.57,2.72] [1.57,2.72] [1.57,2.72] (2.72,3.2]  (2.72,3.2] 
[46] (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  (2.72,3.2]  [1.57,2.72] (2.72,3.2]  (2.72,3.2]  [1.57,2.72] [1.57,2.72]
[55] [1.57,2.72]
Levels: [1.57,2.72] (2.72,3.2]

We are now ready to create a 2 by 2 grid of sub-plots using the cut.hhsiz variable to condition on the vertical axis (resulting in different rows of graphs) and manbronx to condition on the horizontal axis (resulting in different columns of graphs). We set facet_grid(cut.hhsiz ~ manbronx) after the commands for the scatter plot and linear smooth.

Note that you need to spell out the full ggplot command, since the object g does not contain the new variable we created. Alternatively, you can re-define g.

You can change the labels, add a title, etc., but we will skip that from now on to just concentrate on the graphs.

ggplot(data=nyc.data,aes(x=kids2000, y=pubast00)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) +
  facet_grid(cut.hhsiz ~ manbronx)

Histogram

Default Histogram

We start with the simple histogram command for the kids2009 variable.

The geom for a histogram is geom_histogram. In contrast to most plots in ggplot, only one variable needs to be passed. The general setup for ggplot is to think of the graph as a two-dimensional representation, with the x variable for the x axis and the y variable for the y-axis. In a histogram, the vertical axis is by default taken to be the count of the observations in each bin.

The three pieces we need to create the plot are the data set (data), nyc.data, the aesthetic (aes), kids2009 (only x by default, with y as the count), and the geom, geom_histogram. The command is as follows, with all the other settings left to their default:

ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram()

Adjusting the number of bins

The graph gives a warning that the default number of 30 bins is inappropriate. The standard way in ggplot is to adjust the number of bins indirectly, by means of the binwidth option, i.e., the range of values that make up a bin, in the units of the variable under consideration. Instead, I prefer to use the option bins, which sets the number of bins directly (the bin width is then obtained by dividing the range by the number of bins).

For example, to set the number of bins to 7, we set bins=7 as an option to geom_histogram.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(bins=7)

Frequency on vertical axis

As mentioned, the default is to give the number of observations in each bin (the count) as the value on the vertical axis. In order to obtain the frequency on the vertical axis, the y variable needs to be set to ..density.., as in aes(x= kids2009, y = ..density..). In all other respects, the histogram is the same. Here, we illustrate the use of bindwidth by passing binwidth=5 as an option to geom_histogram.

ggplot(data=nyc.data,aes(kids2009,y=..density..)) +
  geom_histogram(binwidth=5)

By group

As before, we can now create a distinction between the distribution for the two subsets in the data determined by the manbronx variable. First, we set the aesthetic color equal to this variable as aes(color=manbronx) in the geom_histogram command.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(color=manbronx),bins=7)

The result is not exactly what we had in mind. The reason is that color determines the outline of the graph, but not the internal color, or fill. Instead, with geom_histogram(aes(fill=manbronx)), we get the bars stacked on top of each other with a different color for each subset.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(fill=manbronx),bins=7)

Still not that great at showing the distinction between the two subsets. Now, we resort to facet_wrap to create a separate histogram for each subset. We highlight the distinction even more by setting the fill color to manbronx, as we just did. The facet_wrap( ~ manbronx) will yield two histograms, in a different color, and with a legend.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(fill=manbronx),bins=7) +
  facet_wrap(~ manbronx)

You can now spiff up these graphs with axis labels, a title, a label for the legend, and even a custom theme.

Frequency Polygon

Another common plot to depict a univariate distribution is the frequency polygon, obtained with geom_freqpoly. It is essentically the same as a histogram, but uses a set of connected points instead of the bars for the histogram. Again, we have to specify a binwidth. A bare bones example is given below, using the same commands as for the histogram, but with geom_freqpoly.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_freqpoly(binwidth=5)

With the frequency polygon it is very easy to show the distribution for different subgroups in the data, for example, by setting aes(color=manbronx) (instead of using facet_wrap). We use bins=7 instead of binwidth to highlight the contrast with a histogram with the same options.

ggplot(data=nyc.data,aes(kids2009)) +
  geom_freqpoly(aes(color=manbronx),bins=7)

Box Plot

The box plot, also referred to as Tukey’s box and whisker plot, is an alternative way to visualize the distribution of a single variable, with a focus on descriptive statistics such as quartiles and the median. The corresponding geom is geom_boxplot. We continue our example using the kids2009 variable. We first consider the default option, then move on to illustrate a few optional settings.

Default Settings

The minimal arguments to create a boxplot are the data set and the x and y variables passed to aes. As mentioned above, the logic behind the graphs in ggplot is two-dimensional, so both x and y need to be specified. The x variable is used to create separate box plots for different subsets of the data. In our simple example, we don’t need this feature, so we set the x variable to empty, i.e., " “. The y variable is the actual variable of interest, kids2009. The resulting graph is shown below.

ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_boxplot()

Note the outlier shown as a dot (point) at the bottom of the graph. The box shows the first quartile, the median, and the third quartile. The lines connect the upper and lower fences, and any outliers are shown as distinct points.

By group

The design of the box plot in ggplot is really geared to show the distribution for different subgroups. Each subgroup gets a box plot, arranged side by side for easy comparison.

For example, we now set `aes(x=manbronx,y=kids2009) to obtain two box plots.

ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_boxplot()

Note how the two categories are listed as labels on the x-axis. Compare this to using facet_wrap( ~ manbronx) with x=" " in the main ggplot command.

ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_boxplot() +
  facet_wrap( ~ manbronx)

It is the same graph, but now the x-axis label is x (as in the generic box plot), and the categories are listed at the top of each graph.

Bells and whistles

As is, the default box plot is pretty rudimentary. We will illustrate the power of ggplot by adding a number of features to the plot in order provide further information. First, add a title using ggtitle. As we did for the histogram, we will center the title over the graph. We keep the x-variable set to manbronx to obtain two box plots and set the label for the y-axis to “Percent HH with Children”.

ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_boxplot() +
  ylab("Percent HH with Children") +
  ggtitle("Example Box Plot") +
  theme(plot.title = element_text(hjust=0.5))

Next, we want to give the box plot a color, following the variable manbronx. This is accomplished with the color (for the outlines) and fill (for the inside of the box) options to aes for geom_boxplot. We add the legend label as Selection as we did before. However, because we now have two features by subset, we need to specify the legend for both, as labs(color="Selection",fill="Selection").

Also, the box plots do not show the fences as separate lines, the way they do in many other software packages, e.g., as in GeoDa. This can be remedied, but not quite in the same way as in GeoDa. In ggplot, the fences are drawn at the location of the extreme values, the ymin and ymax, and not at the location of the fence cut-off values, as in GeoDa. The fences are obtained from the stat_boxplot function, by passing the geom as errorbar.

One final refinement. In GeoDa, the box plot also shows the locations of the actual observations as points on the central axis. We obtain the same effect by adding geom_point with color blue. We draw the points first, and the box plot on top of it, using the layers logic. However, we want to make sure that the central box doesn’t mask the points, which it does when the transparency is kept as the default. To accomplish this, we set the alpha level for both points and box plot at 0.5.

The result is now a quite fancy graph that provides a lot of information on the relative distribution of the percentage households with children in the two subsets of the data. This level of refinement comes at the expense of having to specify eight lines of code.

ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_point(color="blue",alpha=0.5) +
  geom_boxplot(aes(color=manbronx,fill=manbronx),alpha=0.5) +
  stat_boxplot(geom="errorbar") +
  ylab("Percent HH with Children") +
  labs(color="Selection",fill="Selection") +
  ggtitle("Example Box Plot") +
  theme(plot.title = element_text(hjust=0.5))

Violin Plot

A violin plot is a hybrid of a box plot and a density plot, showing regions of the data where more observations are found. The basic setup is the same as for a box plot (i.e., setting x = "" for a single plot), but now using the geom geom_violin.

ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_violin()

Now, again, with some bells and whistles. We show the violin plot separately for each subset by specifying aes(x=manbronx,y=kids2009) in the main ggplot command. Then we plot the individual points using geom_point with color="blue" and alpha=0.5. We set the color and fill for the geom_violin to manbronx, just as we did for the box plot. Finally, we finish with setting the y-axis label, the legend label and the title.

ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_point(color="blue",alpha=0.5) +
  geom_violin(aes(color=manbronx,fill=manbronx),alpha=0.5) +
  ylab("Percent HH with Children") +
  labs(color="Selection",fill="Selection") +
  ggtitle("Example Violin Plot") +
  theme(plot.title = element_text(hjust=0.5))

Tidy Data and Graphs

Often, the data are not in a format that allows for us to create the graphs that we want. For example, what if we wanted to show, for each neighborhood, how the stability of the neighborhood (the years households resided in the neighborhood) changed over the years. We have this information in three variables, i.e., yrhom02, yrhom05, yrhom08. But how could we show these for each neighborhood, when the aesthetics want us to define one x and one y variable. The reason this doesn’t work is that the data set is not in a tidy format, where each row pertains to a single data point. In our case, that means we need both the neighborhood and the year for the variable yrhom, something that is called a panel data format (both space and time).

I will briefly illustrate some of the features of the tidyverse, specifically the operation to gather observations such as different columns are turned into rows.

To keep things simple, we first create a new data frame stable, which is a subset of the variables: code, name, yrhom02, yrhom05, yrhom08, and manbronx. We use the pipe and the select command. To check the result, we print the contents of the new data frame.

stable <- nyc.data %>% select(code,name,yrhom02,yrhom05,yrhom08,manbronx)
stable

The new data frame only contains information on yrhom, the median number of years households lived in the neighborhood. So, really, the different yrhomxx variables pertain to different years of the same thing. To simplify matters, and to illustrate how to deal with non-standard variables in the tidyverse, we will use rename to turn the three variable names into the respective years. Because the years are non-standard variable names, we need to put them in quotes, as `2002`, etc. We use the pipe to rename the three variables, and print the data frame to check the result.

stable <- stable %>% rename(`2002`=yrhom02,`2005`=yrhom05,`2008`=yrhom08)
stable

Now, we illustrate the very powerful (but not always easy to understand) gather function. What we want to do is to create a new variable to designate the observations for different years. Those years are 2002, 2005, and 2008. Right now, these are separate columns with the information on neighborhood stability for each neighborhood in the respective year. What we want is a data frame where each observation has both a neighborhood and a year, and a value for the stability (i.e., the median year households lived there). To do this by hand, we would have to copy the neighborhood code, name and manbronx classification, and then manually enter the year and the corresponding value for yrhom. For each neighborhood, we would have three rows, one for each year. In addition to the neighborhood code, name and manbronx classification, we would have two extra columns: one with the year, and one with the value for the stability (years lived in neighborhood).

In a tidy data frame, the name for the column with the years (i.e., the observations we are gathering) is called a key. We have to give this a name that makes sense, e.g., key=year. The values for the observations on year will be the column headings of the variables we want to gather, i.e., the three years 2002, 2005 and 2008. The values that we are interested in are the number of years resided in the neighborhood, but these don’t currently have a variable name. They are contained under the 2002, 2005 and 2008 variable headings. The name for this new variable is called a value. Again, we have to specify a name for it, e.g., value=stable. So now, we have two pieces: we have the name for the new variable that will label the extra observations by year (the key), and we have a name for the variable that will contain the relevant values (the value). Now, we still need to specify where those values will come from. These are the three columns 2002, 2005, and 2008.

The command is gather(key=year,value=stable,2002,2005,2008). What this does is create a new column for year, a new column for stable, it assigns the values of 2002, 2005 and 2008 to year, and takes the matching values and assigns them to the corresponding row for the column stable.

Let’s again use the pipe and assign the result to a new data frame stneigh.

stneigh <- stable %>% gather(key=year,value=stable,`2002`,`2005`,`2008`)
stneigh

We can now create a plot of the value of stable for each of the three years in each of the neighborhoods. One way (but not a very pretty one) is to do this as a scatter plot using geom_point with x=stable and y=name. For example, using theme_classic() and giving the points a different color for each year.

ggplot(data=stneigh,aes(x=stable,y=name)) +
  geom_point(aes(color=year)) +
  theme_classic()

This can be much improved upon, but I leave that as a so-called exercise.

Alternatively, it is now very easy to create box plot for each year next to each other, by setting x=year and y=stable for the box plot command.

ggplot(data=stneigh,aes(x=year,y=stable)) +
  geom_boxplot()

This is just a simple example of how gather can be used to turn a so-called wide data set into a narrow form. Similarly, the command spread can be used to do the reverse. That is beyond our current scope, but check the documentation for details.

---
title: "Statistical Graphs (2)"
author: "Luc Anselin"
date: "11/04/2018"
output: html_notebook
---

## Preliminaries

Make sure to move to the proper working directory and that the file 
**NYC_Sub_borough_Area.dbf** is in that directory.

Also, activate the **tidyverse** and **foreign** packages using the `library` command
(the **tidyverse** package includes **ggplot2**):

```{r}
library(tidyverse)
library(foreign)
```

We will continue with the data for the NYC sub-boroughs. If you did not save the data frame from the previous
lab, then read it in from **NYC_Sub_borough_Area.dbf**  and  add the **manbronx** variable if you don't have it.
If you saved the data frame using `saveRDS`, then you can load it directly using `readRDS`.

Below follow the commands to start from scratch (check the previous lab for explanations):

```{r}
nyc.data <- read.dbf("NYC_Sub_borough_Area.dbf")
nyc.data <- as_tibble(nyc.data)
nyc.data <- nyc.data %>% mutate(manbronx = if_else((code > 300 & code < 311) | (code > 100 & code < 111),"Select","Rest"))
nyc.data
```

## Spiffing up the Graph

So far, we have used the defaults for the various descriptive aspects of the graph,
such as axis labels, title, etc. All these can be specified (in great detail). Only
the very basics will be covered here, but the options and combinations are virtually
endless.

To illustrate these features, we will continue to use the scatterplot of **kids2000** on
**pubast00**, with **manbronx** as a categorical variable to define subgroups.

### Axis labels
The default setting for the axis labels is to use the variable name. Sometimes, this is
not very informative. To set the labels explicitly, we use the `xlab` and `ylab` functions
These are added in the same way as actual layers, using the + notation and with
the respective options in parentheses (do not use an = sign).

For example, with our default scatter plot, we can set `xlab("Percent HH with Children")`
and `ylab("Percent Public Assistance")` (note, the font and font size can be specified
as well, but we don't go that far).

As in the previous lab, we assign the main `ggplot` command to the object **g** to save
us some typing.

```{r}
g <- ggplot(data=nyc.data,aes(x=kids2000, y=pubast00))
```

```{r}
g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance")
```

### Title

A title is added to the graph by means of the `ggtitle` command. Again, enter the desired
title in parentheses and enclosed by quotes. For example, we can add `ggtitle("Example Scatter Plot")`
(we also keep the axis labels):

```{r}
g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot")
```

The default is to have the title left-aligned. Often, one may want it centered above the graph.
Again, this can be customized. We can 
override the basic settings in the `theme` command. 
For example, we adjust the `plot.title` (of course, you need to know
what everything is called). Specifically, we set the `element_text` property's
horizontal justification (`hjust`) to 0.5. Specifically, we
use `theme(plot.title = element_text(hjust = 0.5))`. This centers the title. The number
of other refinements is near infinite and beyond our scope at this point.

```{r}
g + geom_point() +
  geom_smooth() +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5))
```


### Legend

The default legend title is the variable name, in our previous examples, that was **manbronx**. To
set the legend title explicitly, we  need the `labs` command. For example, we set `aes(color=manbronx)`
for the points and the linear smoother (`method="lm"`), and the legend to `labs(color="Selection")`.

```{r}
g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color="Selection")
```

### Theme

Every graph has a theme, which sets the main parameters for its appearance.
The default theme with the grey grids, separated by white lines is
`theme_grey( )`. If we want to change this, we can specify one of the other
themes. For example, a classic graph a la base R plot, without background
shading or grid lines is `theme_classic( )`. In order to obtain this specialized *look*, we set the associated `theme` command.

```{r}
g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_classic()
```

There are seven built-in themes as well as several contributed
ones. Another built-in example is `theme_minimal( )`, shown next.

```{r}
g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_minimal()
```

In addition, the package **ggthemes** contains several additional themes that look
extremely professional. 

You will have to install the package first, if you don't have it. Once installed, invoke
it with the `library` command.

```{r}
library(ggthemes)
```

Then, for example, using `theme_tufte( )`:

```{r}
g + geom_point(aes(color=manbronx)) +
  geom_smooth(aes(color=manbronx),method="lm",se=FALSE) +
  xlab("Percent HH with Children") +
  ylab("Percent Public Assistance") +
  ggtitle("Example Scatter Plot") +
  labs(color="Selection") +
  theme_tufte()
```

Each of the themes has specific options for further customization, so that the possibilities
are nearly limitless.

## Conditional Plots

Conditional plots are a major feature of the functionality of **ggplot**, where they are referred to as *facetting*, or *small multiples*. This is implemented in the `facet_wrap` and `facet_grid` functions. The difference between the two is that `facet_wrap` is based on a single conditioning variable (so, essentially one-dimensional), whereas
`facet_grid` typically has two conditioning variables (as a grid of sub-plots).

In **ggplot**, the conditioning is based on a categorical variable that needs to be available in the data set. The facetting formula does not evaluate functions, so the conditioning categories need to be computed beforehand.
In our case, we already have the **manbronx** variable.

### Facet Wrap

The `facet_wrap` function creates multiple graphs for subsets of the data as determined by
a conditioning variable. There are several options, such as explicitly setting the number of
rows and columns, but we still stick with the bare defaults to illustrate the principle.

For example, to condition on **manbronx**, we use `facet_wrap(~ manbronx)`.

```{r}
g +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~manbronx)
```

#### Sidebar -- R formulas

An important feature of R is the so-called *formula*, which is a way to express a relationship
among variables. For example, for a linear relationship such as y = a + bx (the linear smoother),
the formula would be `y ~ x`. The `~` takes the place of the equal sign, the constant is assumed
by default, and the parameters are
not listed for a linear function (in other situations they may be). The way the conditioning variables are 
specified for the facet graphs is similar: the vertical conditioning variable comes first, the horizontal
conditioning variable next. If there is only one, as in `facet_wrap`, 
it goes on the right hand side of the `~` sign.

So, in our example, where we condition on **manbronx**, the *formula* is ` ~ manbronx ` (nothing on the 
left side of `~`).


#### Rearranging the facet graphs

The `facet_wrap` function has several options to explicitly set the number of rows or columns
in which the subgraphs should be displayed. For example, using `nrow = 2` arranges the graphs
vertically.

```{r}
g +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(~manbronx,nrow=2)
```

### Facet Grid
The main difference between the two facetting approaches is that `facet_grid` is explicitly two-dimensional. This
requires two variables to set the conditions, one of the vertical dimension (y) and one for the horizontal
dimension (x).

As mentioned earlier, in **ggplot**, the conditioning is based on a categorical variable that needs to be available in the data set. In our example, we only have one, so we will need to create a second one.

There are three so-called helper functions to make this easy: `cut_interval`,
`cut_width`, and `cut_number`. For example, with `cut_number` we pass the variable, e.g., **hhziz00**, and the number of categories, say `n = 2`. This creates the new variable as an R `factor`, giving the intervals that resulted from the cut. Note that the variable needs to be accessed in the usual way, using [[ ]] or $ and assigned back
to the data frame (this is not a `mutate` command, although you could create a factor that way as well).

For example, we create a new variable **cut.hhsiz** using a quantile classification with two categories (the variable will be split on the median
value), by setting `n=2`. We need to use the `$` notation to ensure that the new variable is added to the relevant data set. Since we only have 55 observations, we can easily list the
full set of values to verify. Internally, they are stored as *factors* (hence,
the summary of the `Levels` at the end of the listing).

```{r}
nyc.data$cut.hhsiz <- cut_number(nyc.data$hhsiz00,n=2)
nyc.data$cut.hhsiz
```

We are now ready to create a 2 by 2 grid of sub-plots using the **cut.hhsiz** variable to condition
on the vertical axis (resulting in different rows of graphs) and **manbronx** to condition on the
horizontal axis (resulting in different columns of graphs). We set `facet_grid(cut.hhsiz ~ manbronx)` after
the commands for the scatter plot and linear smooth.

Note that you need to spell out the full `ggplot` command, since the object **g** does not contain
the new variable we created. Alternatively, you can re-define **g**.

You can change the labels, add a title, etc., but we will skip that from now on to just
concentrate on the graphs.

```{r}
ggplot(data=nyc.data,aes(x=kids2000, y=pubast00)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) +
  facet_grid(cut.hhsiz ~ manbronx)
```



## Histogram

### Default Histogram

We start with the simple histogram command for the **kids2009** variable.

The `geom` for a histogram is `geom_histogram`. In contrast to most plots in
**ggplot**, only one variable needs to be passed. The general setup for **ggplot** is to think of the graph as a two-dimensional representation, with the x variable for the x axis and the y variable for the y-axis. In a histogram, the vertical axis is by default taken to be the **count** of the observations in each bin.

The three pieces we need to create the plot are the data set (`data`), **nyc.data**, the
aesthetic (`aes`), **kids2009** (only x by default, with y as the count), and the geom, `geom_histogram`. The command is as follows,
with all the other settings left to their default:

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram()
```

### Adjusting the number of bins

The graph gives a warning that the default number of 30 bins is inappropriate.
The standard way in **ggplot** is to adjust the number of bins indirectly, by
means of the `binwidth` option, i.e., the range of values that make up a bin,
in the units of the variable under consideration. Instead, I prefer to use the option `bins`, 
which sets the number of bins directly (the bin width is then obtained by dividing the range
by the number of bins).

For example, to set the number of bins to 7, we set `bins=7` as an option to `geom_histogram`.

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(bins=7)
```

### Frequency on vertical axis

As mentioned, the default is to give the number of observations in each bin (the **count**) as
the value on the vertical axis. In order to obtain
the frequency on the vertical axis, the y variable needs to be set to `..density..`,
as in `aes(x= kids2009, y = ..density..)`. In all other respects, the histogram is the same.
Here, we illustrate the  use of `bindwidth` by passing `binwidth=5` as an option to `geom_histogram`.


```{r}
ggplot(data=nyc.data,aes(kids2009,y=..density..)) +
  geom_histogram(binwidth=5)
```



### By group

As before, we can now create a distinction between the distribution for the two subsets
in the data determined by the **manbronx** variable. First, we set the aesthetic `color` equal
to this variable as `aes(color=manbronx)` in the `geom_histogram` command.

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(color=manbronx),bins=7)
```

The result is not exactly what we had in mind. The reason is that color determines the outline
of the graph, but not the internal color, or `fill`. Instead, with `geom_histogram(aes(fill=manbronx))`,
we get the bars stacked on top of each other with a different color for each subset.

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(fill=manbronx),bins=7)
```

Still not that great at showing the distinction between the two subsets. Now, we resort to `facet_wrap`
to create a separate histogram for each subset. We highlight the distinction even more by setting
the fill color to **manbronx**, as we just did. The `facet_wrap( ~ manbronx)` will yield two 
histograms, in a different color, and with a legend.


```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_histogram(aes(fill=manbronx),bins=7) +
  facet_wrap(~ manbronx)
```

You can now spiff up these graphs with axis labels, a title, a label for the legend, and
even a custom theme.

## Frequency Polygon

Another common plot to depict a univariate distribution is the *frequency polygon*,
obtained with `geom_freqpoly`. It is
essentically the same as a histogram, but uses a set of connected points instead of
the bars for the histogram. Again, we have to specify a `binwidth`. A bare bones
example is given below, using the same commands as for the histogram, but with
`geom_freqpoly`.

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_freqpoly(binwidth=5)
```

With the frequency polygon it is very easy to show the distribution for different 
subgroups  in the data, for example, by  setting `aes(color=manbronx)` (instead of using `facet_wrap`).
We use `bins=7` instead of `binwidth` to highlight the contrast with a histogram with
the same options.

```{r}
ggplot(data=nyc.data,aes(kids2009)) +
  geom_freqpoly(aes(color=manbronx),bins=7)
```

## Box Plot

The box plot, also referred to as Tukey's box and whisker plot, is an alternative way to visualize the distribution of a single variable, with a focus on descriptive statistics such as quartiles and the median. The 
corresponding geom is `geom_boxplot`. We continue our example using the **kids2009** variable. We first consider the default option, then move on to illustrate a few optional settings.

### Default Settings
The minimal arguments to create a boxplot are the `data` set and the x and y
variables passed to `aes`. As mentioned above, the logic behind the graphs in **ggplot** is two-dimensional, so both x and y need to be specified. The x variable is used to create separate box plots for different subsets of the data. In our simple example, we don't need this feature, so we set the x variable to empty, i.e., " ".
The y variable is the actual variable of interest, **kids2009**. The resulting graph is shown below.


```{r}
ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_boxplot()
```

Note the outlier shown as a dot (point) at the bottom of the graph. The box shows the first quartile,
the median, and the third quartile. The lines connect the upper and lower fences, and any outliers
are shown as distinct points.

### By group

The design of the box plot in **ggplot** is really geared to show the distribution for
different subgroups. Each subgroup gets a box plot, arranged side by side for easy comparison.

For example, we now set `aes(x=manbronx,y=kids2009) to obtain two box plots.

```{r}
ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_boxplot()
```

Note how the two categories are listed as labels on the x-axis. Compare this to
using `facet_wrap( ~ manbronx)` with `x=" "` in the main `ggplot` command.



```{r}
ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_boxplot() +
  facet_wrap( ~ manbronx)
```

It is the same graph, but now the x-axis label is **x** (as in the generic box plot), and the
categories are listed at the top of each graph.

### Bells and whistles

As is, the default box plot is pretty rudimentary. We will illustrate the power of **ggplot** by adding a number of features to the plot in order provide further information. First, add a title using `ggtitle`. As we did for the histogram, we will center the title over the graph. We keep the x-variable set to **manbronx** to obtain two
box plots and set the label for the  y-axis to "Percent HH with Children".

```{r}
ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_boxplot() +
  ylab("Percent HH with Children") +
  ggtitle("Example Box Plot") +
  theme(plot.title = element_text(hjust=0.5))
```

Next, we want to give the box plot a color, following the variable **manbronx**. This is accomplished with the 
`color` (for the outlines) and `fill` (for the inside of the box) options to `aes` for `geom_boxplot`.
We add the legend label as **Selection** as we did before. However, because we now have two features
by subset, we need to specify the legend for both, as `labs(color="Selection",fill="Selection")`.

Also, the box plots do not show the fences as separate lines, the way they do in many
other software packages, e.g., as in GeoDa. This can
be remedied, but not quite in the same way as in GeoDa. In **ggplot**, the fences are
drawn at the location of the extreme values, the **ymin** and **ymax**,
and not at the location of the fence cut-off values, as in GeoDa. The fences are
obtained from the `stat_boxplot` function, by passing the `geom` as `errorbar`.

One final refinement. In GeoDa, the box plot also shows the locations of the actual observations as points on the central axis. We obtain the same effect by adding
`geom_point` with `color` **blue**. We draw the points first, and the box plot on
top of it, using the layers logic. However, we want to make sure that the central box
doesn't mask the points, which it does when the *transparency* is kept as the default.
To accomplish this, we set the `alpha` level for both points and box plot at **0.5**.

The result is now a quite fancy graph that provides a lot of information on the relative
distribution of the percentage households with children in the two subsets of the data.
This level of refinement comes at the expense of having to specify eight lines of code.

```{r}
ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_point(color="blue",alpha=0.5) +
  geom_boxplot(aes(color=manbronx,fill=manbronx),alpha=0.5) +
  stat_boxplot(geom="errorbar") +
  ylab("Percent HH with Children") +
  labs(color="Selection",fill="Selection") +
  ggtitle("Example Box Plot") +
  theme(plot.title = element_text(hjust=0.5))
```

## Violin Plot

A violin plot is a hybrid of a box plot and a density plot, showing regions of the
data where more observations are found. The basic setup is the same as for a box plot
(i.e., setting `x = ""` for a single plot), but now using the geom `geom_violin`.

```{r}
ggplot(data=nyc.data,aes(x="",y=kids2009)) +
  geom_violin()
```

Now, again, with some bells and whistles. We show the violin plot separately for each subset
by specifying `aes(x=manbronx,y=kids2009)` in the main `ggplot` command. Then we plot the
individual points using `geom_point` with `color="blue"` and `alpha=0.5`. We set the
`color` and `fill` for the `geom_violin` to **manbronx**, just as we did for the box plot.
Finally, we finish with setting the y-axis label, the legend label and the title.


```{r}
ggplot(data=nyc.data,aes(x=manbronx,y=kids2009)) +
  geom_point(color="blue",alpha=0.5) +
  geom_violin(aes(color=manbronx,fill=manbronx),alpha=0.5) +
  ylab("Percent HH with Children") +
  labs(color="Selection",fill="Selection") +
  ggtitle("Example Violin Plot") +
  theme(plot.title = element_text(hjust=0.5))
```

## Tidy Data and Graphs

Often, the data are not in a format that allows for us to create the graphs that we want.
For example, what if we wanted to show, for each neighborhood, how the stability of the
neighborhood (the years households resided in the neighborhood) changed over the years.
We have this information in three variables, i.e., **yrhom02**, **yrhom05**, **yrhom08**.
But how could we show these for each neighborhood, when the aesthetics want us to 
define one x and one y variable. The reason this doesn't work is that the data set is
not in a *tidy* format, where each row pertains to a single data point. In our case,
that means we need both the neighborhood and the year for the variable yrhom, something
that is called a panel data format (both space and time).

I will briefly illustrate some of the features of the tidyverse, specifically the operation
to *gather* observations such as different columns are turned into rows.

To keep things simple, we first create a new data frame **stable**, which is a subset
of the variables: **code**, **name**, **yrhom02**, **yrhom05**, **yrhom08**, and **manbronx**.
We use the pipe and the `select` command. 
To check the result, we print the contents of the new data frame.

```{r}
stable <- nyc.data %>% select(code,name,yrhom02,yrhom05,yrhom08,manbronx)
stable
```

The new data frame only contains information on **yrhom**, the median number of years households lived in the
neighborhood. So, really, the different **yrhomxx** variables pertain to different years of the same thing.
To simplify matters, and to illustrate how to deal with non-standard variables in the *tidyverse*, we will
use `rename` to turn the three variable names into the respective years. Because the years are non-standard
variable names, we need to put them in quotes, as \`2002\`, etc. We use the pipe to rename the three variables,
and print the data frame to check the result.

```{r}
stable <- stable %>% rename(`2002`=yrhom02,`2005`=yrhom05,`2008`=yrhom08)
stable
```

Now, we illustrate the very powerful (but not always easy to understand) `gather` function. What we want to
do is to create a new variable to designate the observations for different years. Those years are `2002`, `2005`, and
`2008`. Right now, these are separate columns with the information on neighborhood stability for each neighborhood
in the respective year. What we want is a data frame where each *observation* has both a neighborhood and a year, and 
a value for the stability (i.e., the median year households lived there). To do this by hand, we would have to copy
the neighborhood code, name and manbronx classification, and then manually enter the year and the corresponding value
for **yrhom**. For each neighborhood, we would have three rows, one for each year. In addition to the neighborhood
code, name and manbronx classification, we would have two extra columns: one with the year, and one with the 
value for the *stability* (years lived in neighborhood). 

In a *tidy* data frame, the name for the column with the years (i.e., the observations we are *gathering*) is called a `key`.
We have to give this a name that makes sense, e.g., `key=year`. The values for the observations on `year` will be
the column headings of the variables we want to *gather*, i.e., the three years `2002`, `2005` and `2008`.
The values that we are interested in are the number of years resided in the neighborhood, but these don't
currently have a variable name. They are contained under the `2002`, `2005` and `2008` variable headings.
The name for this new variable is called a `value`. Again, we have to specify a name for it, e.g.,
`value=stable`. So now, we have two pieces: we have the name for the new variable that will label the extra
observations by year (the `key`), and we have a name for the variable that will contain the relevant
values (the `value`). Now, we still need to specify where those values will come from. These are the 
three columns `2002`, `2005`, and `2008`.

The command is `gather(key=year,value=stable,`2002`,`2005`,`2008`)`. What this does is create a new column
for `year`, a new column for `stable`, it assigns the values of `2002`, `2005` and `2008` to `year`, and
takes the matching values and assigns them to the corresponding row for the column `stable`.

Let's again use the pipe and assign the result to a new data frame **stneigh**.

```{r}
stneigh <- stable %>% gather(key=year,value=stable,`2002`,`2005`,`2008`)
stneigh
```

We can now create a plot of the value of **stable** for each of the three years in each of the neighborhoods.
One way (but not a very pretty one) is to do this as a scatter plot using `geom_point` with `x=stable` and
`y=name`. For example, using `theme_classic()` and giving the points a different color for each year.

```{r}
ggplot(data=stneigh,aes(x=stable,y=name)) +
  geom_point(aes(color=year)) +
  theme_classic()
```

This can be much improved upon, but I leave that as a so-called *exercise*.

Alternatively, it is now very easy to create box plot for each year next to each other,
by setting `x=year` and `y=stable` for the box plot command.

```{r}
ggplot(data=stneigh,aes(x=year,y=stable)) +
  geom_boxplot()
```

This is just a simple example of how `gather` can be used to turn a so-called *wide* data set into a
*narrow* form. Similarly, the command `spread` can be used to do the reverse. That is beyond our current
scope, but check the documentation for details.









