The Grammar of Graphics

The Grammar of Graphics is built on three components -data, mappings, and geometries. The ggplot2 package incorporates this approach explicitly in creating graphics.

First we’ll load the packages we will need first.

Let’s take a built-in dataset, diamonds, and get an understanding what the data look like.

We can also look at the data dictionary for this dataset using ?diamonds in the Console. The least transparent of the variables are the x, y, and z which in the data dictionary clarify that they refer to the length, width, and depth in mm.

We also see that we have some categorical variables cut and clarity as well as some continuous variables carat and price, for example.

Knowing the informational value of our variables is key to creating an appropriate visualization. In the Grammar of Graphics we are going to take some data, in this case diamonds, then map some variables to a variable space with the aes() function, and then decide how to display these variables in a geometric space with the set of geom_x functions. Let’s see an example. First we can set up the data.

At this point there is data, but no mappings nor geometries to be represented. Let’s add a mapping of the color variable to the x-axis and plot that as a bar plot.

Since this is a categorical variable, the y-axis is the count. We can augment this graphic by adding a fill aesthetic mapped to the cut variable.

We can also add a third categorical dimension to this plot with a facet. The facet_wrap() function can be added to the graphic represent a new plot for each level of the categorical variable.

Now let’s look at the continuous variable price. A histogram is a good way to visualize a single continuous variable.

If we would like to see a combination of a categorical and continuous variable, a box plot is a often a good choice.

A box plot, as you see, displays boxes for each of the grouping values in ‘color’. The boxes contain a lot of distributional information. Namely, we get a visual display of much of the information we get when we run ‘summary()’ on a continuous variable. Min and max values are marked with the vertical line, the top and bottom of the box correspond to the 1st and 3rd quantile, and the bold line is the median. Dots on the vertical line are marked as outliers. It’s a pretty powerful visualization!

We could also add a fill aesthetic to boxplots too, just as we did with bar plots earlier.

And we can add another categorical variable through faceting.

Another type of common visualization situation is when we want to compare to continuous variables. This is where the scatterplot comes in. Let’s compare the carat and the price. The geom_point() function will map points to the x and y values.

We can also add a trend line by adding another layer with the geom_smooth() function. But to do this we need to pass the same mappings to both the geom_point() and geom_smooth() geometries. We can do this redundantly for each geometry, or we can take a different approach to adding the data and aesthetic mappings than we have so far. The alternative is seen below.

To add a linear trendline, we can add a specific method to the geom_smooth() function.

As you might imagine, we can add a color aesthetic to indicate those points that are of certain clarity. Note: after a %>% or + you can add a line break if you find that it helps make your code more legible.

That’s a quick rundown of the basics of the Grammar of Graphics and the ggplot2 package. There is so much more ggplot2 can do and surely more that you want to know how to do. A good place to start is the Ggplot2 package website. I also recommend the “R Graphics Cookbook” by Winston Chang.