Intro to ggplot2

Elliot Shannon

2023-04-13

What is ggplot2

ggplot2 is a versatile and elegant R package for visualizing data. It is a member of the tidyverse family of packages.

What is ggplot2

You can make all sorts of plots with ggplot2

What is ggplot2

You can make all sorts of plots with ggplot2

What is ggplot2

You can make all sorts of plots with ggplot2

Introduction

R for Data Science is a great resource which is freely available online. We will be following the material from Chapter 3.

Introduction

  • ggplot2 implements the grammar of graphics to describe and build figures and graphs
  • This way, we can do more faster by learning one system and applying it in many situations

Introduction

  • Often, when we first work with a new dataset, we use data visualization to better understand the data and look for any potential patterns

  • ggplot2 fits right into our tidyverse workflow, and will be our tool for the job

Motivating Dataset

Recall the FEF_trees.csv dataset.

library(tidyverse)
trees <- read_csv("./data/FEF_trees.csv")
glimpse(trees)
Rows: 88
Columns: 18
$ watershed         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot              <dbl> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species           <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in            <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft         <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg     <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg      <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg   <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg       <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg        <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg   <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg   <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg   <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg     <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…

First Steps

  • Question: Do taller trees have greater DBH than shorter trees?
  • What does this relationship look like? Is it positive? Negative? Linear? Nonlinear?
  • In our trees tibble, we have two columns containing dbh_in and height_ft

First Steps

# Create a scatterplot of dbh_in and height_ft
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft))

First Steps

  • We begin our plot with the ggplot() function
  • ggplot() creates a coordinate system that we can add layers to
  • ggplot() takes a dataset as a first argument (here, data = trees)
# Create an empty graph
ggplot(data = trees)

First Steps

  • Next, we add one or more layers to ggplot()
  • geom_point() adds a layer of points
  • There are many different geom layers that can be added to a ggplot()
# Add geom_point() layer
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft))

First Steps

  • Each geom function takes a mapping argument
  • This argument defines how variables in data are mapped to visual properties
  • These visual properties are paired with the aes() function
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft))

First Steps

  • We can turn this into a general graphing template
  • We will frequently use this structure
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetic Mappings

  • We may be interested in digging even deeper into this graph
  • Are there any species-specific patterns?

Aesthetic Mappings

  • Recall that trees contains a column called species
  • We can add this third variable to a two dimensional scatterplot by mapping it to an aesthetic
  • Aesthetics include things like size, shape, and color of your points.

Aesthetic Mappings

  • We can use the following code to color each point by species
  • We see that the largest and tallest trees are Prunus serotina
# Color points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species))

Aesthetic Mappings

  • Here we added the color argument to the aes() function in the mapping for our points
  • We set color = species, where species is a column in our trees tibble
# Color points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species))

Aesthetic Mappings

  • ggplot uses scaling to automatically assign a unique level of the aesthetic
  • In this case, each species is automatically assigned a unique color
# Color points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species))

Aesthetic Mappings

  • We could just as easily map species to the size aesthetic instead of color
  • However, this is not advised. Why?
# Size points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, size = species))

Aesthetic Mappings

  • It would make more sense to map size to something like allwoody_dry_kg
  • What does this plot show?
# Size points by dry woody biomass
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, size = allwoody_dry_kg))

Aesthetic Mappings

  • Some other common aesthetics are shape
# Shape points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, shape = species))

Aesthetic Mappings

  • and alpha
# Set transparency of points by dry woody biomass
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, alpha = allwoody_dry_kg))

Aesthetic Mappings

  • For each aesthetic, you use aes() to associate the name of the aesthetic with a variable to display
  • Note that x and y are themselves aesthetics!
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft))

Aesthetic Mappings

  • Once you map an aesthetic, ggplot2 takes care of selecting a reasonable scale and constructing a legend
  • You can also set aesthetic properties manually
# Manually set point color to red
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft),  color = "red")

Aesthetic Mappings

  • Note that the color argument in this case goes outside of the aes() function, since we are not mapping a variable to an aesthetic
# Manually set point color to red
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft),  color = "red")

Aesthetic Mappings

  • We can set colors with color strings (e.g. “red”)
  • We can set point sizes in mm
  • We can set point shapes as a number (shown below)

Common Problems

  • A common problem with ggplot2 is the placement of the +
  • The + has to come at the end of a line, not the start
# WRONG
ggplot(data = trees)
  + geom_point(mapping = aes(x = dbh_in, y = height_ft))

# RIGHT
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft))

Facets

  • Facets are a particularly useful way to display categorical variables
  • Recall our scatterplot colored by species
# Color points by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species))

Facets

  • It may be useful to instead split this plot into facets, with different subplots for each species
  • Here we use facet_wrap(), which takes a formula as its argument created with ~

Facets

# Create subplots by species
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft)) +
    facet_wrap(~ species)

Facets

  • We can facet our plot on the combination of two variables
  • Recall the watershed column in the trees tibble
  • We will use facet_grid() to accomplish this

Facets

# Create subplots by both species and watershed
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft)) +
    facet_grid(watershed ~ species)

Geometric Objects

  • We can represent our data using different geometries
  • For example, we dont have to use points, we can use a smooth line
# Plot dbh_in vs height_ft using a smooth line
ggplot(data = trees) +
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft))

Geometric Objects

  • Here, geom_point() was replaced with geom_smooth()
  • The same mapping arguments were given, specifying x and y in the aes() function
# Plot dbh_in vs height_ft using a smooth line
ggplot(data = trees) +
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft))

Geometric Objects

  • Every geom function takes a mapping argument
  • However, not every aesthetic works with every geom
  • For instance, you can set the shape of a point, but you can’t set the shape of a line
  • However, you could set the linetype of a line

Geometric Objects

# Plot dbh_in vs height_ft using a different line for each species
ggplot(data = trees) +
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft, linetype = species))

Geometric Objects

  • We can also make a graph like this, where both points and lines are displayed with data colored by species.

Geometric Objects

  • The code to make the previous figure is shown below.
  • Notice that we are including two geoms!
# Plot data as both points and lines colored by species
ggplot(data = trees) + 
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) + 
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft, color = species))

Geometric Objects

  • We can remove the crowded standard error bands using se = FALSE
# Plot data as both points and lines colored by species
ggplot(data = trees) + 
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) + 
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft, color = species), 
              se = FALSE)

Geometric Objects

  • It is good to avoid repetition by passing a mapping argument to the ggplot() function
# Original
ggplot(data = trees) + 
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) + 
  geom_smooth(mapping = aes(x = dbh_in, y = height_ft, color = species), 
              se = FALSE)

# Less repetition
ggplot(data = trees, mapping = aes(x = dbh_in, y = height_ft)) + 
  geom_point(mapping = aes(color = species)) + 
  geom_smooth(mapping = aes(color = species), se  = 
                FALSE)

Statistical Transformations

  • The last graph we’ll cover today is the histogram

Statistical Transformations

  • To make a histogram, we use the geom_histogram geometry, and give an x variable to bin our data by
ggplot(data = trees) +
  geom_histogram(mapping = aes(x = dbh_in))

Statistical Transformations

  • Here, we only give an x aesthetic, and ggplot2 will compute count values for each bin
ggplot(data = trees) +
  geom_histogram(mapping = aes(x = dbh_in))

Statistical Transformations

  • We add other aesthetics such as fill to reveal more patterns in the data
ggplot(data = trees) +
  geom_histogram(mapping = aes(x = dbh_in, fill = species))

Statistical Transformations

  • Or we can use facets as before to make individual sub-plots for each species
# Create subplots by species
ggplot(data = trees, mapping = aes(fill = species)) +
  geom_histogram(mapping = aes(x = dbh_in)) +
    facet_wrap(~ species)

Main and Axis Titles

  • We often want to add a main title to our graphs, as well as change the axis titles
  • We can use the following functions:
    • ggtitle()
    • xlab()
    • ylab()
  • All we have to do is give each function a string argument

Main and Axis Titles

# Create plot with main and axis titles
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) +
  ggtitle("Species-specific DBH vs. Height") +
  xlab("DBH (in)") +
  ylab("Height (ft)")

Main and Axis Titles

To center the title, we can specify a theme and set the plot.title to be adjusted

# Create plot with centered main title and axis titles
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) +
  ggtitle("Species-specific DBH vs. Height") +
  xlab("DBH (in)") +
  ylab("Height (ft)") +
  theme(plot.title = element_text(hjust = 0.5))

Main and Axis Titles

We can then use arguments such as family, face, color, and size to customize our labels

# Create plot with fun main and axis titles
ggplot(data = trees) +
  geom_point(mapping = aes(x = dbh_in, y = height_ft, color = species)) +
  ggtitle("Species-specific DBH vs. Height") +
  xlab("DBH (in)") +
  ylab("Height (ft)") +
  theme(plot.title = element_text(color = "purple", size = 40, hjust = 0.5),
        axis.title.x = element_text(color = "darkgreen", face = "bold"),
        axis.title.y = element_text(color = "red", size = 25, family = "Times"))

Main and Axis Titles