6  Data transformation I

The next few chapters cover various {tidyverse} functions - sometimes called verbs - that are used for data transformation.

This chapter covers functions that are “non-aggregating”: they allow you to change which columns or rows are present in the dataset, to create new columns or split up existing ones, to change their contents, etc. rename(), which you learned in the previous chapter, is one of these functions.

6.1 Keeping or dropping columns with dplyr::select()

Not all columns in a dataset are useful to us. When processing data, we often wish to keep only some columns and drop others. We can do this with dplyr::select().

We’ll start where the last chapter ended, by loading dependencies, reading in the ‘data_demographics_raw_messy.csv’ dataset, and renaming the columns to make them easier to work with.

library(readr) # for read_csv()
library(janitor) # for clean_names()
library(dplyr) # for %>%
library(knitr) # for kable()
library(kableExtra) # for kable_extra()

dat_demographics_renamed <- 
  read_csv(file = "../data/raw/data_demographics_raw_messy.csv",
           skip = 2) %>%
  clean_names() %>%
  rename(id = subject_code, 
         block_trial = block_code_and_trial_number, 
         question = trial_code, 
         response = key_response_use_this, 
         rt = x0_ms_onset_rt)

dat_demographics_renamed %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1372
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
23.06.2022 548957868 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD 1 3946
23.06.2022 504546409 06.06.2000 demographics_3 prolific ID ad;oi98732paisdh 1 3724
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2576
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 2576
23.06.2022 994692692 06.06.2000 demographics_3 prolific ID 1023984cao982738 1 3050
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD 1 6887
23.06.2022 246532124 06.06.2000 demographics_2 prolific ID 2982879q8w7r 1 6887

If our data processing goal is to extract and tidy responses on the demographics questionnaire, we only need some of these columns: id, question, and response. We can use select() and the pipe to retain only these.

dat_demographics_selected <- dat_demographics_renamed %>%
  select(id, 
         question, 
         response)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id question response
548957868 age 23
548957868 gender female
548957868 psychiatric diagnosis schizophrenia
548957868 prolific ID asldkjaao87809
504546409 age 48
504546409 gender yes
504546409 psychiatric diagnosis OCD
504546409 prolific ID ad;oi98732paisdh
994692692 age 21
994692692 gender female
994692692 psychiatric diagnosis schizophrenia
994692692 prolific ID 1023984cao982738
246532124 age 999
246532124 gender Male
246532124 psychiatric diagnosis GAD
246532124 prolific ID 2982879q8w7r

What code would you write to select the ‘date’ and ‘correct’ columns?

dat_demographics_selected <- dat_demographics_renamed %>%
  select(date, 
         correct)

What code would you write to print all the columns present in ‘dat_demographics_renamed’ in a way you can easily paste into a select() call? This was covered in the previous chapter.

dat_demographics_renamed %>%
  colnames() %>%
  dput()
c("date", "id", "build", "block_trial", "question", "response", 
"correct", "rt")

6.1.1 Positive vs. negative selections

The above example tells select() the columns to retain. You could instead tell it which ones to drop. This is a negative selection: you preface the column name with -.

dat_demographics_selected <- dat_demographics_renamed %>%
  select(-date, 
         -build, 
         -block_trial, 
         -correct, 
         -rt)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id question response
548957868 age 23
548957868 gender female
548957868 psychiatric diagnosis schizophrenia
548957868 prolific ID asldkjaao87809
504546409 age 48
504546409 gender yes
504546409 psychiatric diagnosis OCD
504546409 prolific ID ad;oi98732paisdh
994692692 age 21
994692692 gender female
994692692 psychiatric diagnosis schizophrenia
994692692 prolific ID 1023984cao982738
246532124 age 999
246532124 gender Male
246532124 psychiatric diagnosis GAD
246532124 prolific ID 2982879q8w7r

It’s tempting to think in terms of negative selections, i.e., getting rid of columns you don’t want. However, you should in general use positive selections and avoid negative ones. If the underlying data changes in some way, e.g., in a prior processing step, it can contain variables your negative selection doesn’t account for. For example, if ‘dat_demographics_renamed’ now included another column, “completed_study” (TRUE/FALSE), the positive selection would not include this variable as it wasn’t listed, but the negative selection would as it was not specifically excluded. This can produce code that is more ‘fragile’ to changes in the data.

In general, tell select what you do want, not what you don’t want.

6.1.1.1 Exercise

Check your understanding of positive and negative selections:

  1. What would happen if you mix positive and negative selections? i.e., What columns will be present in the output when your run the code below?
dat_demographics_selected <- dat_demographics_renamed %>%
  select(id, 
         question, 
         response,
         -correct)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)
  1. Why are those columns present and not others?

Once positive selections are present in the select call, the negative ones are redundant. You’ve already told select() what to include, which implies what not to include.

6.1.2 {tidyselect} helper functions

Often, we have large datasets with many columns. select() calls can be made simpler with the help of another {tidyverse} package, {tidyselect}.

To illustrate this, we’ll use a simulated dataset called ‘dat_many_columns’ that contains, you guessed it, many columns.

Let’s start by printing the column names:

dat_many_columns %>%
  colnames() %>%
  dput()
c("id", "age", "gender", "condition", "depression_scale1_item1", 
"depression_scale1_item2", "depression_scale1_item3", "depression_scale1_item4", 
"depression_scale1_item5", "depression_scale1_item6", "depression_scale1_item7", 
"depression_scale1_item8", "depression_scale1_item9", "depression_scale1_item10", 
"depression_scale2_item1", "depression_scale2_item2", "depression_scale2_item3", 
"depression_scale2_item4", "depression_scale2_item5", "depression_scale2_item6", 
"depression_scale2_item7", "depression_scale3_item1", "depression_scale3_item2", 
"depression_scale3_item3", "depression_scale3_item4", "depression_scale3_item5", 
"depression_scale3_item6", "depression_scale3_item7", "depression_scale3_item8", 
"depression_scale3_item9", "depression_scale3_item10", "depression_scale3_item11", 
"depression_scale3_item12", "depression_scale3_item13", "depression_scale3_item14", 
"depression_scale3_item15")

6.1.2.1 Exercise

Write code to:

  1. Create a new data frame called ‘dat_fewer_columns’
  2. select() all the columns beginning with “depression”, using only positive selections
  3. Print the column names to confirm that these are the only ones present
dat_fewer_columns <- dat_many_columns %>%
  select(depression_scale1_item1, 
         depression_scale1_item2, 
         depression_scale1_item3, 
         depression_scale1_item4, 
         depression_scale1_item5, 
         depression_scale1_item6, 
         depression_scale1_item7, 
         depression_scale1_item8, 
         depression_scale1_item9, 
         depression_scale1_item10, 
         depression_scale2_item1, 
         depression_scale2_item2, 
         depression_scale2_item3, 
         depression_scale2_item4, 
         depression_scale2_item5, 
         depression_scale2_item6, 
         depression_scale2_item7, 
         depression_scale3_item1, 
         depression_scale3_item2, 
         depression_scale3_item3, 
         depression_scale3_item4, 
         depression_scale3_item5, 
         depression_scale3_item6, 
         depression_scale3_item7, 
         depression_scale3_item8, 
         depression_scale3_item9, 
         depression_scale3_item10, 
         depression_scale3_item11, 
         depression_scale3_item12, 
         depression_scale3_item13, 
         depression_scale3_item14, 
         depression_scale3_item15)

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("depression_scale1_item1", "depression_scale1_item2", "depression_scale1_item3", 
"depression_scale1_item4", "depression_scale1_item5", "depression_scale1_item6", 
"depression_scale1_item7", "depression_scale1_item8", "depression_scale1_item9", 
"depression_scale1_item10", "depression_scale2_item1", "depression_scale2_item2", 
"depression_scale2_item3", "depression_scale2_item4", "depression_scale2_item5", 
"depression_scale2_item6", "depression_scale2_item7", "depression_scale3_item1", 
"depression_scale3_item2", "depression_scale3_item3", "depression_scale3_item4", 
"depression_scale3_item5", "depression_scale3_item6", "depression_scale3_item7", 
"depression_scale3_item8", "depression_scale3_item9", "depression_scale3_item10", 
"depression_scale3_item11", "depression_scale3_item12", "depression_scale3_item13", 
"depression_scale3_item14", "depression_scale3_item15")

This can be done more easily with tidyselect::starts_with():

library(tidyselect)

dat_fewer_columns <- dat_many_columns %>%
  select(starts_with("depression"))

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("depression_scale1_item1", "depression_scale1_item2", "depression_scale1_item3", 
"depression_scale1_item4", "depression_scale1_item5", "depression_scale1_item6", 
"depression_scale1_item7", "depression_scale1_item8", "depression_scale1_item9", 
"depression_scale1_item10", "depression_scale2_item1", "depression_scale2_item2", 
"depression_scale2_item3", "depression_scale2_item4", "depression_scale2_item5", 
"depression_scale2_item6", "depression_scale2_item7", "depression_scale3_item1", 
"depression_scale3_item2", "depression_scale3_item3", "depression_scale3_item4", 
"depression_scale3_item5", "depression_scale3_item6", "depression_scale3_item7", 
"depression_scale3_item8", "depression_scale3_item9", "depression_scale3_item10", 
"depression_scale3_item11", "depression_scale3_item12", "depression_scale3_item13", 
"depression_scale3_item14", "depression_scale3_item15")

There are many other {tidyselect} helper functions including:

  • starts_with() : Selects columns ending in a character string
  • ends_with() : Selects columns ending in a character string
  • contains() : Selects columns containing a character string, not necessarily at the start or end
  • all_of() : Selects all of the column names in a vector e.g., all_of(c("a", "b"))
  • any_of() : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
  • where() : Select columns that meet a criterion, e.g., where(is.numeric) selects only columns containing numeric data.
  • everything() : Selects all columns. This will seem redundant right now, but will become useful later on when we cover pivot_ functions

6.1.2.2 Exercise

Write code to select all columns belonging to scale 2 from the ‘dat_many_columns’ data frame.

dat_fewer_columns <- dat_many_columns %>%
  select(contains("scale2"))

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("depression_scale2_item1", "depression_scale2_item2", "depression_scale2_item3", 
"depression_scale2_item4", "depression_scale2_item5", "depression_scale2_item6", 
"depression_scale2_item7")

6.1.2.3 Exercise

Write code to select from the ‘dat_many_columns’ data frame the “id”, “age” and “gender” columns AND all columns related to item3.

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, contains("item3"))

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("id", "age", "gender", "depression_scale1_item3", "depression_scale2_item3", 
"depression_scale3_item3")

6.1.2.4 Exercise

Write code to select the “id”, “age” and “gender” columns AND all columns related to item1.

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, ends_with("item1"))

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("id", "age", "gender", "depression_scale1_item1", "depression_scale2_item1", 
"depression_scale3_item1")

6.1.3 Selection with set operations

Negative selection using - is selection using a “set operation” or a set of logical rules that determine which column names are in the selected set. There are others:

  • Set operations:
    • - negative selection
    • c() : combine selections
    • ! invert selection
    • & must meet both selection rules
    • | can meet either selection criteria

For example: select columns that refer to scale 3 AND items >=10:

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, 
         # items containing scale3 AND item1 but NOT item 1
         c(contains("scale3") & contains("item1") & !ends_with("item1")))

dat_fewer_columns %>%
  colnames() %>%
  dput()
c("id", "age", "gender", "depression_scale3_item10", "depression_scale3_item11", 
"depression_scale3_item12", "depression_scale3_item13", "depression_scale3_item14", 
"depression_scale3_item15")

6.1.3.1 Exercise

To practice {tidyselect} helpers and selection set operations, write code in a creative and unnecessarily complex way to select the columns “id”, “gender”, and “age”, and the items from scale 1 and scale 3 that whose items are <= 10.

Note: no solution provided here as many different ones are possible.

6.1.4 Practicing select and the pipe in a longer chunk

6.1.4.1 Exercise

The first chunk in this chapter read in the ‘data_demographics_raw_messy.csv’ file and renamed columns.

  • Rewrite that chunk here - as much as possible from memory, without simply copy pasting.
  • Add another pipe and selecting the “id”, “question” and “response” columns.
  • Print a table of the resulting data frame using kable() and kable_classic()

Note: no solution provided here to reduce the temptation to peek or copy-paste.

6.2 Moving columns with dplyr::relocate()

You can reorder to columns within a data frame with relocate().

Unlike rename(), select(), and other functions, only one column can be relocated with each relocate() call.

You can relocate columns before or after others, or to locations such as being the first or last column. This code moves the “response” column to be before the “question” column, and the “correct” column to be the last one.

dat_demographics_renamed %>%
  relocate(response, .before = "question") %>%
  relocate(correct, .after = last_col()) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial response question rt correct
23.06.2022 548957868 06.06.2000 demographics_2 23 age 1372 1
23.06.2022 548957868 06.06.2000 demographics_3 female gender 619 1
23.06.2022 548957868 06.06.2000 demographics_2 schizophrenia psychiatric diagnosis 1372 1
23.06.2022 548957868 06.06.2000 demographics_3 asldkjaao87809 prolific ID 619 1
23.06.2022 504546409 06.06.2000 demographics_2 48 age 3946 1
23.06.2022 504546409 06.06.2000 demographics_3 yes gender 3724 1

6.2.0.1 Exercise

What code is needed to relocate the columns of ‘dat_demographics_renamed’ so that they are in the order “id”, “question”, “response”, and then the remaining columns.

dat_demographics_reordered <- dat_demographics_renamed %>%
  relocate(id, .before = 1) %>% 
  relocate(question, .after = "id") %>%
  relocate(response, .after = "question")

dat_demographics_reordered %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id question response date build block_trial correct rt
548957868 age 23 23.06.2022 06.06.2000 demographics_2 1 1372
548957868 gender female 23.06.2022 06.06.2000 demographics_3 1 619
548957868 psychiatric diagnosis schizophrenia 23.06.2022 06.06.2000 demographics_2 1 1372
548957868 prolific ID asldkjaao87809 23.06.2022 06.06.2000 demographics_3 1 619
504546409 age 48 23.06.2022 06.06.2000 demographics_2 1 3946
504546409 gender yes 23.06.2022 06.06.2000 demographics_3 1 3724

6.2.0.2 Exercise

All three of the solutions below accomplish the goal of making the “id” column the first column:

# move "id" to be the first column by placing it before "date"
dat_demographics_renamed %>%
  relocate(id, .before = "date") %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id date build block_trial question response correct rt
548957868 23.06.2022 06.06.2000 demographics_2 age 23 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 gender female 1 619
548957868 23.06.2022 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619
504546409 23.06.2022 06.06.2000 demographics_2 age 48 1 3946
504546409 23.06.2022 06.06.2000 demographics_3 gender yes 1 3724
# move "date" to be the second column by placing it after "id", producing the same result
dat_demographics_renamed %>%
  relocate(date, .after = "id") %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id date build block_trial question response correct rt
548957868 23.06.2022 06.06.2000 demographics_2 age 23 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 gender female 1 619
548957868 23.06.2022 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619
504546409 23.06.2022 06.06.2000 demographics_2 age 48 1 3946
504546409 23.06.2022 06.06.2000 demographics_3 gender yes 1 3724
# move "id" to be the 1st column by placing it before the first column by location, producing the same result
dat_demographics_renamed %>%
  relocate(id, .before = 1) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id date build block_trial question response correct rt
548957868 23.06.2022 06.06.2000 demographics_2 age 23 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 gender female 1 619
548957868 23.06.2022 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
548957868 23.06.2022 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619
504546409 23.06.2022 06.06.2000 demographics_2 age 48 1 3946
504546409 23.06.2022 06.06.2000 demographics_3 gender yes 1 3724


Which of the above solutions is the least ‘fragile’ solution, i.e., that is robust to changes in the data or elsewhere in your data processing code?

relocate(id, .before = 1) does not rely on the name and location of another column and is therefore the least fragile solution.

6.3 Renaming, relocating, and selecting all-in-one with select()

Renaming, relocating, and selecting columns are so common as a set of three tasks that you can do it all within select().

Whatever order columns appear in select() they will be relocated to. You can rename variables inside select just as you do in rename() using new_name = old_name.

dat_demographics_renamed %>%
  select(id, date, question, answer = response) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id date question answer
548957868 23.06.2022 age 23
548957868 23.06.2022 gender female
548957868 23.06.2022 psychiatric diagnosis schizophrenia
548957868 23.06.2022 prolific ID asldkjaao87809
504546409 23.06.2022 age 48
504546409 23.06.2022 gender yes

6.4 Keeping or dropping rows with dplyr::filter()

Not all rows in a dataset are useful to us. When processing data, we often wish to keep only some rows and drop others. We can do this with dplyr::filter().

It is important to learn the difference between select() and filter(), it is initially confusing.

  • select() keeps or drops columns
  • filter() keeps or drops rows

For example, filter the ‘question’ column to contain only “age” items:

dat_filtered_age <- dat_demographics_renamed %>%
  filter(question == "age")

dat_filtered_age %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1372
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3946
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2576
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4311

6.4.0.1 Exercise

Write code to:

  • Take the ‘dat_demographics_renamed’ data frame.
  • Create two data frames called ‘dat_filtered_age’ and ‘dat_filtered_gender’.
  • Retain only the relevant rows of ‘question’ for each.
  • Rename the ‘response’ columns to what variable it now contains.
  • Drop the ‘question’ columns since they’re now irrelevant.
  • Print a table of each data frame.
dat_filtered_age <- dat_demographics_renamed %>%
  filter(question == "age") %>%
  rename(age = response) %>%
  select(-question)

dat_filtered_age %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial age correct rt
23.06.2022 548957868 06.06.2000 demographics_2 23 1 1372
23.06.2022 504546409 06.06.2000 demographics_2 48 1 3946
23.06.2022 994692692 06.06.2000 demographics_2 21 1 2576
23.06.2022 246532124 06.06.2000 demographics_2 999 1 4311
dat_filtered_gender <- dat_demographics_renamed %>%
  filter(question == "gender") %>%
  rename(gender = response) %>%
  select(-question)

dat_filtered_gender %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial gender correct rt
23.06.2022 548957868 06.06.2000 demographics_3 female 1 619
23.06.2022 504546409 06.06.2000 demographics_3 yes 1 3724
23.06.2022 994692692 06.06.2000 demographics_3 female 1 3050
23.06.2022 246532124 06.06.2000 demographics_3 Male 1 2793

6.4.1 Filters with set operations

You can specify the logical test for filtering in many ways, including:

  • == : ‘is exactly equal to this one value’
  • != : ’is not exactly equal to
  • %in% : ‘is exactly equal to one of the following values’

Note that equivalence is two equals signs (==) whereas setting arguments in functions is one (=).

# '==' exactly equal to
dat_filtered_gender %>%
  filter(gender == "female") %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial gender correct rt
23.06.2022 548957868 06.06.2000 demographics_3 female 1 619
23.06.2022 994692692 06.06.2000 demographics_3 female 1 3050
# '!=' is not exactly equal to
dat_filtered_gender %>%
  filter(gender != "female") %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial gender correct rt
23.06.2022 504546409 06.06.2000 demographics_3 yes 1 3724
23.06.2022 246532124 06.06.2000 demographics_3 Male 1 2793
# '%in%' is exactly equal to one of the following
dat_filtered_gender %>%
  filter(gender %in% c("female", "male")) %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial gender correct rt
23.06.2022 548957868 06.06.2000 demographics_3 female 1 619
23.06.2022 994692692 06.06.2000 demographics_3 female 1 3050

Weirdly, there is no single function that provides the combination of %in% and !=, i.e., ‘is not exactly equal to any of the following values’. This can be achieved with the following, although it is not intuitive. Note that it still returns rows with “Male” because we only told it to exclude rows with “male” (lower case).

# '!' and '%in%'
dat_filtered_gender %>%
  filter(!gender %in% c("female", "male")) %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial gender correct rt
23.06.2022 504546409 06.06.2000 demographics_3 yes 1 3724
23.06.2022 246532124 06.06.2000 demographics_3 Male 1 2793

You can also combine multiple tests with various operators:

  • & : AND, i.e., meeting both criterion.
  • | : OR, i.e., meeting either criterion.
# & criterion X and Y
dat_demographics_renamed %>%
  filter(question == "gender" & response == "female")  %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050
# | criterion X or Y
dat_demographics_renamed %>%
  filter(question == "gender" | response == "female")  %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3724
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2793

6.4.1.1 Exercise

Write code to:

  • Take the ‘dat_demographics_renamed’ data frame.
  • Create the data frame ‘dat_filtered_rt’.
  • Retain only rows where where reaction time (‘re’) is more than 500 ms.
  • Print a table of each data frame.
dat_filtered_rt <- dat_demographics_renamed %>%
  filter(rt > 500) 

dat_filtered_rt %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1372
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
23.06.2022 548957868 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD 1 3946
23.06.2022 504546409 06.06.2000 demographics_3 prolific ID ad;oi98732paisdh 1 3724
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2576
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 2576
23.06.2022 994692692 06.06.2000 demographics_3 prolific ID 1023984cao982738 1 3050
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD 1 6887
23.06.2022 246532124 06.06.2000 demographics_2 prolific ID 2982879q8w7r 1 6887

6.4.1.2 Exercise

Write code to:

  • Take the ‘dat_demographics_renamed’ data frame.
  • Create the data frame ‘dat_filtered_rt’.
  • Retain only rows where where both reaction time (‘rt’) is more than 500 ms and gender is either ‘male’, ‘female’, or ‘non-binary’.
  • Print a table of each data frame.
dat_filtered_rt <- dat_demographics_renamed %>%
  filter(rt > 500 & question == 'gender' & response %in% c('male', 'female', 'non-binary')) 

dat_filtered_rt %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050

6.4.1.3 Exercise

In the above answer, why might it be a good idea to include question == 'gender'?

Because the requirement was “…and gender is either ‘male’, ‘female’, or ‘non-binary’”.

If participants responded to another question (e.g., ‘age’) with ‘male’, ‘female’, or ‘non-binary’, their rows would also be returned.

Being specific to the requirements makes your code less ‘brittle’ and more likely to catch weird cases in larger datasets.

6.4.2 Positive vs. negative filters

As with select(), it is often better to define what you do want (== and %>%) rather than what you do not want (!= and !+%in%). This is even more important with filter() as unexpected row values are much more common than unexpected column names.

For example, if you specify a negative filter dat %>% filter(gender != "non-binary)" because you wish to filter to return only participants who identify as ‘male’ or ‘female’, you have in fact only excluded participants who specified ‘non-binary’. This code would also return participants with other response options such as ‘non binary’, ‘Non-Binary’, ‘NB’, etc.

However, there can be cases where you specifically do not want some rows or want to retain anything other than a given value. Protecting participant privacy is one of them.

Some rows of the ‘dat_demographics_renamed’ data frame contain “prolific ID” values, which could be linked back to the individual’s Prolific.com account where they completed the study, and therefore risk exposing identifying information. If this data was shared with other researchers or publicly, this could violate the informed consent agreement and/or local data privacy laws. (Note that the prolific IDs included in this data frame are made up.) This is discussed in more detail in the chapter on Data Privacy.

In this case, you would likely need to remove all rows referring to Prolific ID with a negative filter:

dat_demographics_anonymized <- dat_demographics_renamed %>%
  filter(question != "prolific ID")
  
dat_demographics_anonymized %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1372
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD 1 3946
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2576
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 2576
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD 1 6887

Note that the original ‘data_demographics_raw_messy.csv’ data file we originally read in to R still contains the Prolific ID codes! Again, see the Data Privacy chapter for how you would deal with this more thoroughly in your own projects.

6.5 Creating and modifying columns with dplyr::mutate()

mutate() is used to create new columns or to change the contents of existing ones.

For example, we can create the ‘rt_seconds’ column by dividing the ‘rt’ column, which is in milliseconds, by 1000.

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(rt_seconds = rt/1000)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt rt_seconds
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1372 1.372
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 619 0.619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1372 1.372
23.06.2022 548957868 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 619 0.619
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3946 3.946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3724 3.724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD 1 3946 3.946
23.06.2022 504546409 06.06.2000 demographics_3 prolific ID ad;oi98732paisdh 1 3724 3.724
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2576 2.576
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3050 3.050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 2576 2.576
23.06.2022 994692692 06.06.2000 demographics_3 prolific ID 1023984cao982738 1 3050 3.050
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4311 4.311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2793 2.793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD 1 6887 6.887
23.06.2022 246532124 06.06.2000 demographics_2 prolific ID 2982879q8w7r 1 6887 6.887

We can also alter existing columns through self-assignment, i.e., mutate()-ing a column based on itself. For example, modifying ‘rt’ to be in seconds by overwriting the existing ‘rt’ column:

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(rt = rt/1000)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 1 1.372
23.06.2022 548957868 06.06.2000 demographics_3 gender female 1 0.619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 1.372
23.06.2022 548957868 06.06.2000 demographics_3 prolific ID asldkjaao87809 1 0.619
23.06.2022 504546409 06.06.2000 demographics_2 age 48 1 3.946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes 1 3.724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD 1 3.946
23.06.2022 504546409 06.06.2000 demographics_3 prolific ID ad;oi98732paisdh 1 3.724
23.06.2022 994692692 06.06.2000 demographics_2 age 21 1 2.576
23.06.2022 994692692 06.06.2000 demographics_3 gender female 1 3.050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia 1 2.576
23.06.2022 994692692 06.06.2000 demographics_3 prolific ID 1023984cao982738 1 3.050
23.06.2022 246532124 06.06.2000 demographics_2 age 999 1 4.311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male 1 2.793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD 1 6.887
23.06.2022 246532124 06.06.2000 demographics_2 prolific ID 2982879q8w7r 1 6.887

We can also change types, for example, converting the ‘correct’ column from numeric (0/1) to logical (TRUE/FALSE).

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(correct = as.logical(correct))

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)
date id build block_trial question response correct rt
23.06.2022 548957868 06.06.2000 demographics_2 age 23 TRUE 1372
23.06.2022 548957868 06.06.2000 demographics_3 gender female TRUE 619
23.06.2022 548957868 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia TRUE 1372
23.06.2022 548957868 06.06.2000 demographics_3 prolific ID asldkjaao87809 TRUE 619
23.06.2022 504546409 06.06.2000 demographics_2 age 48 TRUE 3946
23.06.2022 504546409 06.06.2000 demographics_3 gender yes TRUE 3724
23.06.2022 504546409 06.06.2000 demographics_2 psychiatric diagnosis OCD TRUE 3946
23.06.2022 504546409 06.06.2000 demographics_3 prolific ID ad;oi98732paisdh TRUE 3724
23.06.2022 994692692 06.06.2000 demographics_2 age 21 TRUE 2576
23.06.2022 994692692 06.06.2000 demographics_3 gender female TRUE 3050
23.06.2022 994692692 06.06.2000 demographics_2 psychiatric diagnosis schizophrenia TRUE 2576
23.06.2022 994692692 06.06.2000 demographics_3 prolific ID 1023984cao982738 TRUE 3050
23.06.2022 246532124 06.06.2000 demographics_2 age 999 TRUE 4311
23.06.2022 246532124 06.06.2000 demographics_3 gender Male TRUE 2793
23.06.2022 246532124 06.06.2000 demographics_2 psychiatric diagnosis GAD TRUE 6887
23.06.2022 246532124 06.06.2000 demographics_2 prolific ID 2982879q8w7r TRUE 6887

6.5.1 Longer pipes

A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:

dat_demographics_mutated <- dat_demographics_renamed %>%
  # select only columns of interest
  select(id, question, response, rt, correct) %>%
  # filter only rows of interest
  filter(question == "age") %>%
  # rename columns to make them more intuitive
  rename(age = response) %>%
  # mutate variables
  mutate(rt_ms = rt/1000,
         correct = as.logical(correct)) %>%
  # select needed columns again
  select(id, age, rt_ms, correct)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id age rt_ms correct
548957868 23 1.372 TRUE
504546409 48 3.946 TRUE
994692692 21 2.576 TRUE
246532124 999 4.311 TRUE

6.5.1.1 Exercise

Write code to:

  • Take the ‘dat_demographics_renamed’.
  • Create a data frame named ‘dat_demographics_mutated’.
  • Retain only the gender question rows.
  • Rename ‘response’ to ‘gender’.
  • Use the stringr::str_to_lower() function to convert ‘gender’ to lower case strings’.
  • Retain only the id and gender columns.
  • Print the data frame.
library(stringr)

dat_demographics_mutated <- dat_demographics_renamed %>%
  filter(question == 'gender') %>%
  rename(gender = response) %>%
  mutate(gender = stringr::str_to_lower(gender)) %>%
  select(id, gender)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)
id gender
548957868 female
504546409 yes
994692692 female
246532124 male

6.5.2 More complex mutates

Most research reports (theses, articles, etc.) report the percentage of the sample that was male, female, etc. For example, the ‘dat_demographics_messy’ contains messy age and gender data for 500 participants.

We’ll use count() to show every unique response across participants. We’ll return to learn to use count() properly in the next chapter. For the moment, we’re more interested in the output than the code.

dat_demographics_messy %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)
gender n
female 297
male 123
nonbinary 15
Female 8
FEMALE 5
femail 4
25 3
27 3
30 3
43 3
45 3
fem ale 3
femle 3
malee 3
18 2
26 2
38 2
40 2
MalE 2
femlae 2
mal 2
19 1
22 1
23 1
31 1
36 1
41 1
42 1
FeMale 1
ma le 1
non binary 1

Note how the data currently contains many misspellings and use of different cases that cause things like ‘female’ to appear as a different category to ‘Female’, etc.

6.5.2.1 mutate() and working with character strings

Working with character strings using {stringr} and ‘regular expressions’ (regex) is a large separate topic in itself.

For the moment, know that you can:

  • Use functions like str_to_lower() to convert all text to lower case (as in the above example).
  • Use str_remove_all() to remove everything except letters (str_remove_all(x, "[^A-Za-z]")) or everything except numbers (str_remove_all(x, "[^0-9\\.]")).

Above, you used {stringr}’s str_to_lower() function to convert all letters to lower case. Let’s apply that here to quickly improve things:

dat_demographics_messy %>%
  # convert to lower case
  mutate(gender = str_to_lower(gender)) %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)
gender n
female 311
male 125
nonbinary 15
femail 4
25 3
27 3
30 3
43 3
45 3
fem ale 3
femle 3
malee 3
18 2
26 2
38 2
40 2
femlae 2
mal 2
19 1
22 1
23 1
31 1
36 1
41 1
42 1
ma le 1
non binary 1

Next we’ll use str_remove_all(x, "[^A-Za-z]") to remove everything except letters:

dat_demographics_messy %>%
  mutate(
    # convert to lower case
    gender = str_to_lower(gender),
    # remove all non-letters
    gender = str_remove_all(gender, "[^A-Za-z]")
  ) %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)
gender n
female 314
male 126
30
nonbinary 16
femail 4
femle 3
malee 3
femlae 2
mal 2

6.5.2.2 mutate() and conditionals rules

6.5.2.2.1 Using dplyr::if_else()

If we were being somewhat lazy, we could keep just the three most common self-reported responses to the gender, i.e. ‘male’, ‘female’ and ‘non-binary’, and change the other responses to NA.

Many forms of data cleaning or ‘wrangling’ involve applying conditional rules like this to clean up irregular responses, i.e., “if gender is not ‘male’, ‘female’ or ‘non-binary’, then set it to NA”.

The simplest of these is dplyr::if_else(), which applies just one if-else. Note that there is a similar function in Base-R, ifelse(), but for technical reasons this function is much less safe and can cause weird errors. Try to always use if_else().

dat_gender_tidier <- dat_demographics_messy %>%
  # convert to lower case and remove non-letters
  mutate(gender = str_to_lower(gender),
         gender = str_remove_all(gender, "[^A-Za-z]")) %>%
  # tidy up cases
  mutate(gender = if_else(condition = gender %in% c("female", "male", "nonbinary"), # the logical test applied: is 'gender' one of the following
                          true = gender, # what to do if the test is passed: keep the original gender response
                          false = NA_character_)) # what to do if the test is failed: set it to NA

dat_gender_tidier %>%
  count(gender) %>%
  arrange(desc(n)) %>%
  kable() %>%
  kable_classic(full_width = FALSE)
gender n
female 314
male 126
NA 44
nonbinary 16

This somewhat quick and dirty solution gives us just four gender categories, ‘male’, ‘female’, ‘non-binary’ and missing (NA).

Of course, these responses were (mostly) not really missing, we removed them to make the responses simpler. This isn’t very good practice, as we’re throwing away data.

6.5.2.2.2 Using dplyr::case_when()

case_when() allows you to check against multiple logical tests. It works like this:

dat %>%
  # create the variable variable_to_create based on:
  mutate(variable_to_create = case_when(
    # if logical_test_1 is met, set variable_to_create to value_1
    logical_test_1 ~ value_1, 
    # if logical_test_2 is met, set variable_to_create to value_2
    logical_test_2 ~ value_2, 
    # if logical_test_3 is met, set variable_to_create to value_3
    logical_test_3 ~ value_3, 
    # 'if none of the above are met, set to the default value value_4'
    TRUE ~ value_4
  ))

Note that the logical tests are checked in order, so you often have to consider the order in which they are listed and run.

We can apply it like this:

dat_gender_tidier <- dat_demographics_messy %>%
  # convert to lower case and remove non-letters
  mutate(gender = str_to_lower(gender),
         gender = str_remove_all(gender, "[^A-Za-z]")) %>%
  # tidy up cases
  mutate(gender = case_when(gender %in% c("female", "male", "nonbinary") ~ gender, 
                            gender %in% c("femail", "fem ale", "femle", "femlae") ~ "female",
                            gender %in% c("malee", "mal", "ma le") ~ "male",
                            gender %in% c("non binary", "nonbinary") ~ "non-binary",
                            gender == "" ~ NA_character_, # if an empty character string, change to NA_character_
                            TRUE ~ gender)) # if none of the above apply, keep original value

dat_gender_tidier %>%
  count(gender) %>%
  arrange(desc(n)) %>%
  kable() %>%
  kable_classic(full_width = FALSE)
gender n
female 323
male 131
NA 30
nonbinary 16

Note that the contents of the case_when() conditions have to be constructed iteratively. Use distinct() to inspect what cases are present, then write code to merge them as desired.

6.6 Exercises

6.6.1 select() vs. filter()

What is the difference between select and filter?

select() is for columns, filter() is for rows.

6.6.2 {tidyselect} helper functions

Name some helper functions from {tidyselect}.

  • starts_with()
  • ends_with()
  • contains()
  • all_of()
  • any_of()
  • where()
  • everything()

Explain what each of them do.

  • starts_with() : Selects columns ending in a character string
  • ends_with() : Selects columns ending in a character string
  • contains() : Selects columns containing a character string, not necessarily at the start or end
  • all_of() : Selects all of the column names in a vector e.g., all_of(c("a", "b"))
  • any_of() : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
  • where() : Select columns that meet a criterion, e.g., where(is.numeric) selects only columns containing numeric data.
  • everything() : Selects all columns.

6.6.3 Positive vs. negative selections and filters

Explain the difference between positive and negative selections in select() and filter() and how they are done in each.

  • In select(), positive selections mean keep these columns (e.g., select(mpg, cyl)), while negative selections mean drop these columns (e.g., select(-mpg, -cyl)).
  • In filter(), conditions act positively by default (e.g., filter(mpg > 20) keeps rows with mpg > 20). To “negate” a condition, you use ! or inequality operators (e.g., filter(!(mpg > 20)) or filter(mpg <= 20)).

6.6.4 Set operators

What set operators can be used with select() and how?

select() supports set operations on columns using tidyselect helpers:

  • - : negative selection (drop columns)
  • c() : combine multiple selections (union)
  • ! : invert a selection rule
  • & : must meet both selection rules (intersection)
  • | : can meet either selection rule (union)

Example:

select(dat, contains("scale3") & contains("item1") & !ends_with("item1"))

What set operators can be used with filter() and how?

With filter(), set operations are expressed with logical operators applied to rows:

  • & : keeps rows that satisfy all conditions (intersection)
  • | : keeps rows that satisfy any condition (union)
  • ! : excludes rows that meet a condition (negation/difference)

Example:

dat %>%
  filter(gender == "female" & age > 30 & !is.na(income))

6.6.5 Interactive exercises

Complete the interactive exercises for:

6.6.6 Practice data transformation using multiple functions

In your local version of this .qmd file:

  • Use the ‘dat_demographics_messy’ to create a new data frame called ‘dat_age_tidier’
  • Use disinct() to understand what unique responses are present in the ‘age’ column.
  • Use the other functions you learned in this chapter to clean the age column so that it contains only numbers.
  • Convert the age variable to a numeric variable with as.numeric().
  • Print the unique responses present in the processed ‘age’ column as a table.