14  Visualization

This chapter covers the basics of {ggplot2}, the tidyverse plotting library, which is based on Hadley Wickam’s The Grammar of Graphics.

Wickham, Navarro, and Lin Pedersen have written an entire open source ebook about ggplot, the ggplot2 book, that is far more detailed than this chapter. I recommend you start with the The Grammar chapter.

14.1 Why plot data?

Summary statistics aren’t enough! These datasets all have nearly identical summary statistics, but very different underlying data.

# M and SD
datasaurus_dozen |>
  group_by(dataset) |>
  summarize(mean_x = mean(x),
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y)) |>
  mutate_if(is.numeric, round_half_up, digits = 2) |>
  kable(align = 'r')|>
  kable_classic(full_width = FALSE)
dataset mean_x sd_x mean_y sd_y
away 54.27 16.77 47.83 26.94
bullseye 54.27 16.77 47.83 26.94
circle 54.27 16.76 47.84 26.93
dino 54.26 16.77 47.83 26.94
dots 54.26 16.77 47.84 26.93
h_lines 54.26 16.77 47.83 26.94
high_lines 54.27 16.77 47.84 26.94
slant_down 54.27 16.77 47.84 26.94
slant_up 54.27 16.77 47.83 26.94
star 54.27 16.77 47.84 26.93
v_lines 54.27 16.77 47.84 26.94
wide_lines 54.27 16.77 47.83 26.94
x_shape 54.26 16.77 47.84 26.93
# correlation
datasaurus_dozen |>
  group_by(dataset) |>
  summarize(correlation = cor(x, y)) |>
  mutate_if(is.numeric, round_half_up, digits = 2) |>
  kable(align = 'r') |>
  kable_classic(full_width = FALSE)
dataset correlation
away -0.06
bullseye -0.07
circle -0.07
dino -0.06
dots -0.06
h_lines -0.06
high_lines -0.07
slant_down -0.07
slant_up -0.07
star -0.06
v_lines -0.07
wide_lines -0.07
x_shape -0.07

Always plot your data! Even if is only exploratory or used as a sanity check, and only presented in your supplementary materials, this should be part of data analysis process.

ggplot(datasaurus_dozen, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~dataset) +
  theme_minimal()

14.2 Structure of a ggplot

14.2.1 Aesthetics

aes() is for ‘aesthetics’ mappings, which describe how variables in the data are mapped to visual properties (aesthetics) of geoms.

In this case, the x axis is set to the column self_esteem, and the y axis is set to the column depression.

Because no geoms have been defined yet, nothing is displayed other than the base plot and its axes.

p <- ggplot(data = dat_likert, aes(x = self_esteem, y = depression))
p

14.2.2 Geoms

Whereas data processing steps are combined with the pipes (%>% and |>), ggplot layers are added with +.

We can add the geom geom_point() to the plot and print it again. geom_point() includes data points for each X-Y coordinate in the data. The alpha argument sets the opacity/transparency of the data points: setting it to .5 gives us some indication of overlapping data points.

p <- p + geom_point(alpha = 0.5) # make data points partially transparent using alpha < 1
p

Note that, behind the scenes, geom_point() is a wrapper function for layer():

p + layer(
  geom = "point", 
  stat = "identity",
  position = "identity"
)

ggplot layers geoms on top of one another.

We can add a second layer with another geom, geom_smooth(). When method = "lm", geom_smooth() adds a linear regression line and its Standard Error.

p <- p + geom_smooth(method = "lm") # linear regression line using lm()
p

14.2.3 Axes

Axis names, scales, labels, and breaks can each be separately controlled using functions such as scale_x_continuous() and scale_y_continuous().

p <- p + 
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression")

p

14.2.4 Themes

Theme can be customised, or a number of built-in theme can be used such as theme_minimal() and theme_linedraw().

p <- p + theme_linedraw()
p

14.2.5 Combined

We can put all these layers together in one chunks:

ggplot(data = dat_likert, aes(x = self_esteem, y = depression)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression") +
  theme_linedraw() 

14.3 Colors

14.3.1 Individual colors

Colors can be chosen directly, including many named colors in R. You can also use hex codes via many hex code picker tools that you can google for.

For example:

ggplot(data = dat_likert, 
       aes(x = self_esteem, 
           y = depression)) +
  geom_point(alpha = .9, color = "seagreen4") +
  geom_smooth(method = "lm", color = "maroon4") +
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression") +
  theme_linedraw() 

14.3.2 Color palettes

Many plots use multiple colors, e.g., one for each group. Entire palettes of colors exist, including those in the popular RColroBrewer package:

library(RColorBrewer)

display.brewer.all()

Example using color brewer:

ggplot(data = dat_likert, 
       aes(x = self_esteem, 
           y = depression, 
           color = gender)) + # add gender to the aesthetics call for color
  geom_point(alpha = .9) +
  geom_smooth(method = "lm", alpha = 0.2) + 
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression") +
  theme_linedraw() +
  scale_color_brewer(palette = "Set2")  # use brewer for the `color` aesthetic, where the mapped variable is discrete/categorical

14.3.3 Accessibility

Vision scientists have put a lot of thought into developing and testing various color palettes such as viridis, which I particularly recommend.

Viridis is multi color-blind friendly, as well as having the property of “monotonic luminosity”, which means that if you print it out in black and white the color gradient still makes sense.

Example using continuous variable:

ggplot(data = dat_likert, 
       aes(x = self_esteem, 
           y = depression, 
           color = depression)) + # add depression to the aesthetics call for color
  geom_point(alpha = .9) +
  geom_smooth(method = "lm", color = "black") + # you can override the global color aesthetic for this geom
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression") +
  theme_linedraw() +
  scale_color_viridis_c(direction = -1) # use viridis_c for the `color` aesthetic, where the mapped variable is continuous

Example using categorical variable:

ggplot(data = dat_likert, 
       aes(x = self_esteem, 
           y = depression, 
           color = gender)) + # add gender to the aesthetics call for color
  geom_point(alpha = .9) +
  geom_smooth(method = "lm", alpha = 0.2) +
  scale_x_continuous(breaks = breaks_pretty(n = 7), name = "Self-esteem") + 
  scale_y_continuous(breaks = breaks_pretty(n = 7), name = "Depression") +
  theme_linedraw() +
  scale_color_viridis_d() # use viridis_d for the `color` aesthetic, where the mapped variable is discrete/categorical

Viridis has multiple component palettes to choose from.

library(viridis)
library(unikn)

n <- 10  # number of colors

seecol(list(magma = magma(n),
            inferno = inferno(n),
            plasma = plasma(n),
            viridis = viridis(n),
            cividis = cividis(n),
            rocket = rocket(n),
            mako = mako(n)), 
       col_brd = "white", lwd_brd = 4, 
       main = "Viridis color palettes (n = 10 from each)")

14.4 Overriding aesthetics and data

Sometimes we want to override the aesthetics we set for the whole plot in the ggplot() call for a specific geom. For example, imagine we want to add a vertical line at the mean AMP score:

ggplot(data = data_after_exclusions,
         aes(x = amp_score)) +
  geom_histogram(binwidth = 0.1) +
  geom_vline(aes(xintercept = mean(amp_score)), # aes() call that applies to just this geom
             linetype = "dotted", 
             color = "magenta", 
             size = .9) + 
  theme_linedraw() 

The same applies to data: specific geoms can receive data from other sources via the data = argument.

14.5 Plot types and geoms

14.5.1 Histogram using geom_histogram()

14.5.1.1 Simple plot for self-reports

ggplot(data = data_after_exclusions,
       aes(x = mean_self_report)) + 
  geom_histogram(binwidth = 1)

14.5.1.2 Slightly better plot for self-reports

ggplot(data = data_after_exclusions,
       aes(x = mean_self_report)) +
  # more intelligent choices for the binwidth and boundary
  geom_histogram(binwidth = 1, boundary = 0.5) +
  # labeling of the axis points
  scale_x_continuous(breaks = scales::breaks_pretty(n = 7),
                     limits = c(0.5, 7.5)) +
  scale_y_continuous(breaks = seq(0, 60, 10)) +
  theme_linedraw()

14.5.1.3 Change binwidth, centre, boundary

With continuous variables, changing the boundary (where the bins start) and the binwidth (or number of bins) can dramatically change the apparent distribution. No one setting is correct as all are technically accurate, but some are more informative or misleading than others.

Boundary:

ggplot(data = data_after_exclusions,
       aes(x = amp_score)) +
  geom_histogram(binwidth = 0.1, boundary = 0) +
  theme_linedraw()

ggplot(data = data_after_exclusions,
       aes(x = amp_score)) +
  geom_histogram(binwidth = 0.1, boundary = 0.05) +
  theme_linedraw()

Binwidth:

ggplot(data = data_after_exclusions,
       aes(x = amp_score)) +
  geom_histogram(binwidth = 0.01, boundary = 0) +
  theme_linedraw()

ggplot(data = data_after_exclusions,
       aes(x = amp_score)) +
  geom_histogram(binwidth = 0.1, boundary = 0) +
  theme_linedraw()

ggplot(data = data_after_exclusions,
       aes(x = amp_score)) +
  geom_histogram(binwidth = 0.25, boundary = 0) +
  theme_linedraw()

14.5.1.4 Exercise

Plot for age.

Create a similar plot for the age variable in data_processed (ie before exclusions).

14.5.2 Density plot using geom_density()

14.5.2.1 Simple plot for self-reports

ggplot(data = data_after_exclusions,
       aes(x = mean_self_report)) +
  geom_density(adjust = 1, # the degree of smoothing can be adjusted here 
               color = "#FF69B4",
               fill = "darkblue", 
               alpha = 0.3) 

14.5.2.2 Exercise

Plot for AMP

Make a similar density plot for the AMP.

  • Add a theme.
  • Change the colors.
  • Make the X axis breaks prettier.
  • Name both axis names more clearly.

14.5.3 Bar plot using geom_col()

Bar plots are bad and usually shouldn’t be used. But they are sometimes unavoidable, so here’s how to make them.

14.5.3.1 Pre-calculation of the summary statistics

library(plotrix)

# create the summary values to be plotted
summary_amp <- data_after_exclusions %>%
  group_by(gender) %>%
  summarize(amp_mean = mean(amp_score),
            amp_se = plotrix::std.error(amp_score))

# plot these values
ggplot(data = summary_amp, 
       aes(x = gender, 
           y = amp_mean)) +
  geom_col() + # equivalent to `geom_bar(stat = "identity")`
  geom_errorbar(aes(ymin = amp_mean - amp_se*1.96,     # mean ± se*1.96 = 95% confidence interval
                    ymax = amp_mean + amp_se*1.96),
                width = 0.2) 

14.5.3.2 Calculation of summary stats inside the ggplot call

ggplot(data = data_after_exclusions, 
       aes(x = gender, 
           y = amp_score)) +
  stat_summary(fun = mean,
               geom = "bar") +
  stat_summary(fun.data = mean_cl_normal,  # confident intervals
               geom = "errorbar",
               width = 0.2) +
  labs(title = "Means and their 95% CIs",
       x = "Gender",
       y = "Mean AMP score") +
  theme_linedraw() 

14.5.3.3 Exercise

Plot for self-reports

Make a similar plot for the self-reports.

  • Use coord_flip() to swap the X and Y axes.
  • Add a theme.
  • Change the size and shapes of the points.
  • Name both axis names more clearly.

14.5.3.4 Exercise

Where and when to wrangle data? Ahead of time in the data processing, or in the ggplot call itself? Both are possible, sometimes one is easier than the other.

How to capitalize ‘Male’ and ‘Female’? You could wrangle before plotting, or change the labels in the guides(). Do both below.

14.5.4 Raincloud plots

Now that you’ve learned how to make bar plots, don’t use them. Use a much more informative plot, such as a raincloud plot. These contain distributional information, raw data, and estimates (e.g., means) and their 95% CI or a box plot.

The choice of geom for your needs is not merely an aesthetic or normative preference; there are many published articles on how some types of plots convey information (e.g., about central tendency) better than others, most of which suggest that bar plots are not ideal.

library(ggrain)
library(forcats)
library(scales)

data_empathy_plotting <- read_csv("../data/processed/data_empathy_plotting.csv") |>
  filter(scale == "IRI_total") |>
  mutate(timepoint = case_when(timepoint == 1 ~ "Pre",
                               timepoint == 2 ~ "Post"),
         timepoint = fct_rev(timepoint)) # reverse factor order

ggplot(data_empathy_plotting, 
       aes(timepoint, score, fill = timepoint)) +
  geom_rain(rain.side   = 'l',
            id.long.var = "unique_id",
            line.args = list(alpha = .1),
            point.args  = list(alpha = .5),
            violin.args = list(alpha = .5),
            # make the boxplot invisible if you don't want it
            # completely hide boxplot
            boxplot.args = list(outlier.shape = NA,
                                colour = NA,
                                fill   = NA,
                        alpha  = 0,
                        linewidth = 0)) +
  stat_summary(fun = mean,
               geom = "point",
               size = 2.5,
               position = position_nudge(x = .15)) +
  stat_summary(fun.data = mean_cl_normal,
               geom = "linerange",
               position = position_nudge(x = .15)) +
  scale_x_discrete(name = "Time point") +
  scale_y_continuous(breaks = scales::breaks_pretty(n = 6), name = "Score") +
  scale_fill_manual(values = c("dodgerblue", "darkorange")) + # custom fill colors
  guides(fill = 'none') + # suppress legend
  theme_linedraw() +
  ggtitle("Interpersonal Reactivity Index") 

14.6 Example of a moderately complex plot

plot_1 <- 
  # data and aesthetics calls
  ggplot(data = data_after_exclusions,
         aes(x = mean_self_report,
             y = amp_score,
             color = gender,
             shape = gender)) +
  # draw lines at the neutral points for both measures
  geom_vline(xintercept = 4, linetype = "dotted") +
  geom_hline(yintercept = 0.5, linetype = "dotted") +
  # draw geoms using the aesthetics (x, y, color and shape)
  ## points
  geom_point() +
  ## fit curves, in this case a linear model
  geom_smooth(method = "lm", alpha = 0.3) +
  # adjust axis labels and ranges
  scale_x_continuous(name = "Explicit evaluation\n(Self-report)",
                     breaks = scales::breaks_pretty(n = 7)) +
  scale_y_continuous(name = "Implicit evaluation\n(Affect Misattribution Procedure)") +
  # apply a theme
  theme_linedraw() + 
  # adjust elements of the theme
  labs(title = "Scatter plot with linear regression lines",
       color = "Gender",
       shape = "Gender") +
  # adjust the colors 
  scale_color_manual(values = c("female" = "#FF69B4",
                                "male" = "#6495ED"),
                     labels = c("female" = "Female",
                                "male" = "Male")) +
  # adjust the shapes
  scale_shape_manual(values = c("female" = 16, 
                                "male" = 17),
                     labels = c("female" = "Female",
                                "male" = "Male")) +
  # display specific x and y coordinates without dropping data points (nb using `limits` drops data points, coord_cartesian does not) 
  coord_cartesian(xlim = c(1, 7),
                  ylim = c(0, 1))

# display plot below chunk
plot_1

# save plot to disk as pdf
ggsave(plot = plot_1,
       filename = "plots/plot_1.pdf", 
       width = 6,
       height = 5)

Note that you can add additional function calls to objects later, e.g., overriding the previous theme_ call with a new one:

# different theme
plot_1 + theme_barbie()

# split the plot across two facets
plot_1 + facet_wrap(~ gender)

14.7 Visualizing subsets and comparisons

Many experiments contain multiple conditions or subsets that you want to compare and contrast in your plot. There are several ways to do this.

Let’s start with a basic dot and whisker plot:

ggplot(data = data_after_exclusions, 
       aes(x = gender, 
           y = amp_score)) +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun.data = mean_cl_normal,  
               geom = "linerange") +
  labs(title = "Means and their 95% CIs",
       x = "Gender",
       y = "Mean AMP score") +
  theme_linedraw()

14.7.1 group or color aesthetics and dodging

If we want to contrast not only male vs female participants, but also the two age groups, we could add age_group as the color aesthetic. However, because these values share X axis values, they aren’t displayed clearly:

ggplot(data = data_after_exclusions, 
       aes(x = gender, 
           y = amp_score, 
           color = age_group)) +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun.data = mean_cl_normal,  
               geom = "linerange") +
  labs(title = "Means and their 95% CIs",
       x = "Gender",
       y = "Mean AMP score") +
  theme_linedraw()

One solution is to change the position of both the dots and the whiskers using position_dodge(). This offsets the groups by the width.

ggplot(data = data_after_exclusions, 
       aes(x = gender, 
           y = amp_score, 
           color = age_group)) +
  stat_summary(fun = mean,
               geom = "point",
               position = position_dodge(width = 0.4)) +
  stat_summary(fun.data = mean_cl_normal,  
               geom = "linerange",
               position = position_dodge(width = 0.4)) +
  labs(title = "Means and their 95% CIs",
       x = "Gender",
       y = "Mean AMP score") +
  theme_linedraw()

14.7.2 Faceting

Another option is to create separate subplots, or facets, for each group. This can be a better choice when including it all in one plot would be ‘over plotting’, i.e., where it makes the plot hard to understand. This isn’t the case here, but let’s compare the method anyway:

ggplot(data = data_after_exclusions, 
       aes(x = gender, 
           y = amp_score)) +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun.data = mean_cl_normal,  
               geom = "linerange") +
  labs(title = "Means and their 95% CIs",
       x = "Gender",
       y = "Mean AMP score") +
  theme_linedraw() +
  facet_wrap(~ age_group)

Note that if you have two categorical variables you want to facet by, one on the X axis of facets and one on the Y, you can use facet_grid(groupsA ~ groupsB).

14.8 Common errors

Limits vs coord_cartesian().

Note that any use of limits removes data before plotting, like using filter(). This can change regression lines or other summaries, which are based only on the plotted data points. In the below, notice how the slope of the regression line changes when using limits. If you want to avoid this, use coord_cartesian() instead. This zooms in or out to the specified coordinates to only plot that area, without actually filtering the underlying datapoints. Notice how the slope is unaffected when using this method.

ggplot(data = data_after_exclusions,
         aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm")

ggplot(data = data_after_exclusions,
         aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylim(0, .5)

ggplot(data = data_after_exclusions,
         aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_cartesian(ylim = c(0, .5))

14.9 Combining plots

In some cases, you don’t just want to facet plots but combine multiple plots, created with separate ggplot calls, into one plot. This helps avoid having to combine plots later in photoshop or other image editing software, which reduces reproducibilty. The patchwork library is extremely useful here.

plot_all <- data_after_exclusions |>
  ggplot(aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("All")

plot_women <- data_after_exclusions |>
  filter(gender == "female") |>
  ggplot(aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Women")

plot_men <- data_after_exclusions |>
  filter(gender == "male") |>
  ggplot(aes(x = mean_self_report,
             y = amp_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Men")

# combine these plots with different arrangements
plot_women + plot_men

plot_women + plot_men + plot_layout(ncol = 1)

plot_all / (plot_women + plot_men)

14.10 Exercise

Using the data_empathy_plotting dataset, create the following plots.

  • Plot 1: Create a density plot of the IRI scores, using different colors for the two timepoints. Clean up the axes, theme, colors, etc.
  • Plot 2: Wrangle the data to a wider format. Create a scatter plot with a linear regression line that predicts IRI scores at post based on IRI scores at pre. Clean up the axes, theme, colors, etc.
  • Combine both plots into one using the patchwork library. The density plot should be above the scatter plot, and it should be only half the height of the scatter plot.
  • Save the plot to disk as both a pdf and a png using ggsave(). Set the height and width of the saved plots, and the plots below the chunks, so that their proportions are reasonable.