6 Data transformation I

The next few chapters cover various {tidyverse} functions - sometimes called verbs - that are used for data transformation.

This chapter covers functions that are “non-aggregating”: they allow you to change which columns or rows are present in the dataset, to create new columns or split up existing ones, to change their contents, etc. rename(), which you learned in the previous chapter, is one of these functions.

6.1 Keeping or dropping columns with `dplyr::select()`

Not all columns in a dataset are useful to us. When processing data, we often wish to keep only some columns and drop others. We can do this with dplyr::select().

We’ll start where the last chapter ended, by loading dependencies, reading in the ‘data_demographics_raw_messy.csv’ dataset, and renaming the columns to make them easier to work with.

library(readr) # for read_csv()
library(janitor) # for clean_names()
library(dplyr) # for %>%
library(knitr) # for kable()
library(kableExtra) # for kable_extra()

dat_demographics_renamed <- 
  read_csv(file = "../data/raw/data_demographics_raw_messy.csv",
           skip = 2) %>%
  clean_names() %>%
  rename(id = subject_code, 
         block_trial = block_code_and_trial_number, 
         question = trial_code, 
         response = key_response_use_this, 
         rt = x0_ms_onset_rt)

dat_demographics_renamed %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
23.06.2022	548957868	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	1	3946
23.06.2022	504546409	06.06.2000	demographics_3	prolific ID	ad;oi98732paisdh	1	3724
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	2576
23.06.2022	994692692	06.06.2000	demographics_3	prolific ID	1023984cao982738	1	3050
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	1	6887
23.06.2022	246532124	06.06.2000	demographics_2	prolific ID	2982879q8w7r	1	6887

If our data processing goal is to extract and tidy responses on the demographics questionnaire, we only need some of these columns: id, question, and response. We can use select() and the pipe to retain only these.

dat_demographics_selected <- dat_demographics_renamed %>%
  select(id, 
         question, 
         response)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	question	response
548957868	age	23
548957868	gender	female
548957868	psychiatric diagnosis	schizophrenia
548957868	prolific ID	asldkjaao87809
504546409	age	48
504546409	gender	yes
504546409	psychiatric diagnosis	OCD
504546409	prolific ID	ad;oi98732paisdh
994692692	age	21
994692692	gender	female
994692692	psychiatric diagnosis	schizophrenia
994692692	prolific ID	1023984cao982738
246532124	age	999
246532124	gender	Male
246532124	psychiatric diagnosis	GAD
246532124	prolific ID	2982879q8w7r

What code would you write to select the ‘date’ and ‘correct’ columns?

Click to show answer

dat_demographics_selected <- dat_demographics_renamed %>%
  select(date, 
         correct)

What code would you write to print all the columns present in ‘dat_demographics_renamed’ in a way you can easily paste into a select() call? This was covered in the previous chapter.

Click to show answer

dat_demographics_renamed %>%
  colnames() %>%
  dput()

c("date", "id", "build", "block_trial", "question", "response", 
"correct", "rt")

6.1.1 Positive vs. negative selections

The above example tells select() the columns to retain. You could instead tell it which ones to drop. This is a negative selection: you preface the column name with -.

dat_demographics_selected <- dat_demographics_renamed %>%
  select(-date, 
         -build, 
         -block_trial, 
         -correct, 
         -rt)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	question	response
548957868	age	23
548957868	gender	female
548957868	psychiatric diagnosis	schizophrenia
548957868	prolific ID	asldkjaao87809
504546409	age	48
504546409	gender	yes
504546409	psychiatric diagnosis	OCD
504546409	prolific ID	ad;oi98732paisdh
994692692	age	21
994692692	gender	female
994692692	psychiatric diagnosis	schizophrenia
994692692	prolific ID	1023984cao982738
246532124	age	999
246532124	gender	Male
246532124	psychiatric diagnosis	GAD
246532124	prolific ID	2982879q8w7r

It’s tempting to think in terms of negative selections, i.e., getting rid of columns you don’t want. However, you should in general use positive selections and avoid negative ones. If the underlying data changes in some way, e.g., in a prior processing step, it can contain variables your negative selection doesn’t account for. For example, if ‘dat_demographics_renamed’ now included another column, “completed_study” (TRUE/FALSE), the positive selection would not include this variable as it wasn’t listed, but the negative selection would as it was not specifically excluded. This can produce code that is more ‘fragile’ to changes in the data.

In general, tell select what you do want, not what you don’t want.

6.1.1.1 Exercise

Check your understanding of positive and negative selections:

What would happen if you mix positive and negative selections? i.e., What columns will be present in the output when your run the code below?

dat_demographics_selected <- dat_demographics_renamed %>%
  select(id, 
         question, 
         response,
         -correct)

dat_demographics_selected %>%
  kable() %>%
  kable_classic(full_width = FALSE)

Click to show answer

Why are those columns present and not others?

Click to show answer

Once positive selections are present in the select call, the negative ones are redundant. You’ve already told select() what to include, which implies what not to include.

6.1.2 {tidyselect} helper functions

Often, we have large datasets with many columns. select() calls can be made simpler with the help of another {tidyverse} package, {tidyselect}.

To illustrate this, we’ll use a simulated dataset called ‘dat_many_columns’ that contains, you guessed it, many columns.

Let’s start by printing the column names:

dat_many_columns %>%
  colnames() %>%
  dput()

c("id", "age", "gender", "condition", "depression_scale1_item1", 
"depression_scale1_item2", "depression_scale1_item3", "depression_scale1_item4", 
"depression_scale1_item5", "depression_scale1_item6", "depression_scale1_item7", 
"depression_scale1_item8", "depression_scale1_item9", "depression_scale1_item10", 
"depression_scale2_item1", "depression_scale2_item2", "depression_scale2_item3", 
"depression_scale2_item4", "depression_scale2_item5", "depression_scale2_item6", 
"depression_scale2_item7", "depression_scale3_item1", "depression_scale3_item2", 
"depression_scale3_item3", "depression_scale3_item4", "depression_scale3_item5", 
"depression_scale3_item6", "depression_scale3_item7", "depression_scale3_item8", 
"depression_scale3_item9", "depression_scale3_item10", "depression_scale3_item11", 
"depression_scale3_item12", "depression_scale3_item13", "depression_scale3_item14", 
"depression_scale3_item15")

6.1.2.1 Exercise

Write code to:

Create a new data frame called ‘dat_fewer_columns’
select() all the columns beginning with “depression”, using only positive selections
Print the column names to confirm that these are the only ones present

Click to show answer

dat_fewer_columns <- dat_many_columns %>%
  select(depression_scale1_item1, 
         depression_scale1_item2, 
         depression_scale1_item3, 
         depression_scale1_item4, 
         depression_scale1_item5, 
         depression_scale1_item6, 
         depression_scale1_item7, 
         depression_scale1_item8, 
         depression_scale1_item9, 
         depression_scale1_item10, 
         depression_scale2_item1, 
         depression_scale2_item2, 
         depression_scale2_item3, 
         depression_scale2_item4, 
         depression_scale2_item5, 
         depression_scale2_item6, 
         depression_scale2_item7, 
         depression_scale3_item1, 
         depression_scale3_item2, 
         depression_scale3_item3, 
         depression_scale3_item4, 
         depression_scale3_item5, 
         depression_scale3_item6, 
         depression_scale3_item7, 
         depression_scale3_item8, 
         depression_scale3_item9, 
         depression_scale3_item10, 
         depression_scale3_item11, 
         depression_scale3_item12, 
         depression_scale3_item13, 
         depression_scale3_item14, 
         depression_scale3_item15)

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("depression_scale1_item1", "depression_scale1_item2", "depression_scale1_item3", 
"depression_scale1_item4", "depression_scale1_item5", "depression_scale1_item6", 
"depression_scale1_item7", "depression_scale1_item8", "depression_scale1_item9", 
"depression_scale1_item10", "depression_scale2_item1", "depression_scale2_item2", 
"depression_scale2_item3", "depression_scale2_item4", "depression_scale2_item5", 
"depression_scale2_item6", "depression_scale2_item7", "depression_scale3_item1", 
"depression_scale3_item2", "depression_scale3_item3", "depression_scale3_item4", 
"depression_scale3_item5", "depression_scale3_item6", "depression_scale3_item7", 
"depression_scale3_item8", "depression_scale3_item9", "depression_scale3_item10", 
"depression_scale3_item11", "depression_scale3_item12", "depression_scale3_item13", 
"depression_scale3_item14", "depression_scale3_item15")

This can be done more easily with tidyselect::starts_with():

library(tidyselect)

dat_fewer_columns <- dat_many_columns %>%
  select(starts_with("depression"))

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("depression_scale1_item1", "depression_scale1_item2", "depression_scale1_item3", 
"depression_scale1_item4", "depression_scale1_item5", "depression_scale1_item6", 
"depression_scale1_item7", "depression_scale1_item8", "depression_scale1_item9", 
"depression_scale1_item10", "depression_scale2_item1", "depression_scale2_item2", 
"depression_scale2_item3", "depression_scale2_item4", "depression_scale2_item5", 
"depression_scale2_item6", "depression_scale2_item7", "depression_scale3_item1", 
"depression_scale3_item2", "depression_scale3_item3", "depression_scale3_item4", 
"depression_scale3_item5", "depression_scale3_item6", "depression_scale3_item7", 
"depression_scale3_item8", "depression_scale3_item9", "depression_scale3_item10", 
"depression_scale3_item11", "depression_scale3_item12", "depression_scale3_item13", 
"depression_scale3_item14", "depression_scale3_item15")

There are many other {tidyselect} helper functions including:

starts_with() : Selects columns ending in a character string
ends_with() : Selects columns ending in a character string
contains() : Selects columns containing a character string, not necessarily at the start or end
all_of() : Selects all of the column names in a vector e.g., all_of(c("a", "b"))
any_of() : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
where() : Select columns that meet a criterion, e.g., where(is.numeric) selects only columns containing numeric data.
everything() : Selects all columns. This will seem redundant right now, but will become useful later on when we cover pivot_ functions

6.1.2.2 Exercise

Write code to select all columns belonging to scale 2 from the ‘dat_many_columns’ data frame.

Click to show answer

dat_fewer_columns <- dat_many_columns %>%
  select(contains("scale2"))

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("depression_scale2_item1", "depression_scale2_item2", "depression_scale2_item3", 
"depression_scale2_item4", "depression_scale2_item5", "depression_scale2_item6", 
"depression_scale2_item7")

6.1.2.3 Exercise

Write code to select from the ‘dat_many_columns’ data frame the “id”, “age” and “gender” columns AND all columns related to item3.

Click to show answer

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, contains("item3"))

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("id", "age", "gender", "depression_scale1_item3", "depression_scale2_item3", 
"depression_scale3_item3")

6.1.2.4 Exercise

Write code to select the “id”, “age” and “gender” columns AND all columns related to item1.

Click to show answer

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, ends_with("item1"))

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("id", "age", "gender", "depression_scale1_item1", "depression_scale2_item1", 
"depression_scale3_item1")

6.1.3 Selection with set operations

Negative selection using - is selection using a “set operation” or a set of logical rules that determine which column names are in the selected set. There are others:

Set operations:
- - negative selection
- c() : combine selections
- ! invert selection
- & must meet both selection rules
- | can meet either selection criteria

For example: select columns that refer to scale 3 AND items >=10:

dat_fewer_columns <- dat_many_columns %>%
  select(id, age, gender, 
         # items containing scale3 AND item1 but NOT item 1
         c(contains("scale3") & contains("item1") & !ends_with("item1")))

dat_fewer_columns %>%
  colnames() %>%
  dput()

c("id", "age", "gender", "depression_scale3_item10", "depression_scale3_item11", 
"depression_scale3_item12", "depression_scale3_item13", "depression_scale3_item14", 
"depression_scale3_item15")

6.1.3.1 Exercise

To practice {tidyselect} helpers and selection set operations, write code in a creative and unnecessarily complex way to select the columns “id”, “gender”, and “age”, and the items from scale 1 and scale 3 that whose items are <= 10.

Note: no solution provided here as many different ones are possible.

6.1.4 Practicing select and the pipe in a longer chunk

6.1.4.1 Exercise

The first chunk in this chapter read in the ‘data_demographics_raw_messy.csv’ file and renamed columns.

Rewrite that chunk here - as much as possible from memory, without simply copy pasting.
Add another pipe and selecting the “id”, “question” and “response” columns.
Print a table of the resulting data frame using kable() and kable_classic()

Note: no solution provided here to reduce the temptation to peek or copy-paste.

6.2 Moving columns with `dplyr::relocate()`

You can reorder to columns within a data frame with relocate().

Unlike rename(), select(), and other functions, only one column can be relocated with each relocate() call.

You can relocate columns before or after others, or to locations such as being the first or last column. This code moves the “response” column to be before the “question” column, and the “correct” column to be the last one.

dat_demographics_renamed %>%
  relocate(response, .before = "question") %>%
  relocate(correct, .after = last_col()) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	response	question	rt	correct
23.06.2022	548957868	06.06.2000	demographics_2	23	age	1372	1
23.06.2022	548957868	06.06.2000	demographics_3	female	gender	619	1
23.06.2022	548957868	06.06.2000	demographics_2	schizophrenia	psychiatric diagnosis	1372	1
23.06.2022	548957868	06.06.2000	demographics_3	asldkjaao87809	prolific ID	619	1
23.06.2022	504546409	06.06.2000	demographics_2	48	age	3946	1
23.06.2022	504546409	06.06.2000	demographics_3	yes	gender	3724	1

6.2.0.1 Exercise

What code is needed to relocate the columns of ‘dat_demographics_renamed’ so that they are in the order “id”, “question”, “response”, and then the remaining columns.

Click to show answer

dat_demographics_reordered <- dat_demographics_renamed %>%
  relocate(id, .before = 1) %>% 
  relocate(question, .after = "id") %>%
  relocate(response, .after = "question")

dat_demographics_reordered %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	question	response	date	build	block_trial	correct	rt
548957868	age	23	23.06.2022	06.06.2000	demographics_2	1	1372
548957868	gender	female	23.06.2022	06.06.2000	demographics_3	1	619
548957868	psychiatric diagnosis	schizophrenia	23.06.2022	06.06.2000	demographics_2	1	1372
548957868	prolific ID	asldkjaao87809	23.06.2022	06.06.2000	demographics_3	1	619
504546409	age	48	23.06.2022	06.06.2000	demographics_2	1	3946
504546409	gender	yes	23.06.2022	06.06.2000	demographics_3	1	3724

6.2.0.2 Exercise

All three of the solutions below accomplish the goal of making the “id” column the first column:

# move "id" to be the first column by placing it before "date"
dat_demographics_renamed %>%
  relocate(id, .before = "date") %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	date	build	block_trial	question	response	correct	rt
548957868	23.06.2022	06.06.2000	demographics_2	age	23	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	gender	female	1	619
548957868	23.06.2022	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619
504546409	23.06.2022	06.06.2000	demographics_2	age	48	1	3946
504546409	23.06.2022	06.06.2000	demographics_3	gender	yes	1	3724

# move "date" to be the second column by placing it after "id", producing the same result
dat_demographics_renamed %>%
  relocate(date, .after = "id") %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	date	build	block_trial	question	response	correct	rt
548957868	23.06.2022	06.06.2000	demographics_2	age	23	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	gender	female	1	619
548957868	23.06.2022	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619
504546409	23.06.2022	06.06.2000	demographics_2	age	48	1	3946
504546409	23.06.2022	06.06.2000	demographics_3	gender	yes	1	3724

# move "id" to be the 1st column by placing it before the first column by location, producing the same result
dat_demographics_renamed %>%
  relocate(id, .before = 1) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	date	build	block_trial	question	response	correct	rt
548957868	23.06.2022	06.06.2000	demographics_2	age	23	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	gender	female	1	619
548957868	23.06.2022	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
548957868	23.06.2022	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619
504546409	23.06.2022	06.06.2000	demographics_2	age	48	1	3946
504546409	23.06.2022	06.06.2000	demographics_3	gender	yes	1	3724

Which of the above solutions is the least ‘fragile’ solution, i.e., that is robust to changes in the data or elsewhere in your data processing code?

Click to show answer

relocate(id, .before = 1) does not rely on the name and location of another column and is therefore the least fragile solution.

6.3 Renaming, relocating, and selecting all-in-one with `select()`

Renaming, relocating, and selecting columns are so common as a set of three tasks that you can do it all within select().

Whatever order columns appear in select() they will be relocated to. You can rename variables inside select just as you do in rename() using new_name = old_name.

dat_demographics_renamed %>%
  select(id, date, question, answer = response) %>%
  head() %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	date	question	answer
548957868	23.06.2022	age	23
548957868	23.06.2022	gender	female
548957868	23.06.2022	psychiatric diagnosis	schizophrenia
548957868	23.06.2022	prolific ID	asldkjaao87809
504546409	23.06.2022	age	48
504546409	23.06.2022	gender	yes

6.4 Keeping or dropping rows with `dplyr::filter()`

Not all rows in a dataset are useful to us. When processing data, we often wish to keep only some rows and drop others. We can do this with dplyr::filter().

It is important to learn the difference between select() and filter(), it is initially confusing.

select() keeps or drops columns
filter() keeps or drops rows

For example, filter the ‘question’ column to contain only “age” items:

dat_filtered_age <- dat_demographics_renamed %>%
  filter(question == "age")

dat_filtered_age %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1372
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3946
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2576
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4311

6.4.0.1 Exercise

Write code to:

Take the ‘dat_demographics_renamed’ data frame.
Create two data frames called ‘dat_filtered_age’ and ‘dat_filtered_gender’.
Retain only the relevant rows of ‘question’ for each.
Rename the ‘response’ columns to what variable it now contains.
Drop the ‘question’ columns since they’re now irrelevant.
Print a table of each data frame.

Click to show answer

dat_filtered_age <- dat_demographics_renamed %>%
  filter(question == "age") %>%
  rename(age = response) %>%
  select(-question)

dat_filtered_age %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	age	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	23	1	1372
23.06.2022	504546409	06.06.2000	demographics_2	48	1	3946
23.06.2022	994692692	06.06.2000	demographics_2	21	1	2576
23.06.2022	246532124	06.06.2000	demographics_2	999	1	4311

dat_filtered_gender <- dat_demographics_renamed %>%
  filter(question == "gender") %>%
  rename(gender = response) %>%
  select(-question)

dat_filtered_gender %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	gender	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	female	1	619
23.06.2022	504546409	06.06.2000	demographics_3	yes	1	3724
23.06.2022	994692692	06.06.2000	demographics_3	female	1	3050
23.06.2022	246532124	06.06.2000	demographics_3	Male	1	2793

6.4.1 Filters with set operations

You can specify the logical test for filtering in many ways, including:

== : ‘is exactly equal to this one value’
!= : ’is not exactly equal to
%in% : ‘is exactly equal to one of the following values’

Note that equivalence is two equals signs (==) whereas setting arguments in functions is one (=).

# '==' exactly equal to
dat_filtered_gender %>%
  filter(gender == "female") %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	gender	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	female	1	619
23.06.2022	994692692	06.06.2000	demographics_3	female	1	3050

# '!=' is not exactly equal to
dat_filtered_gender %>%
  filter(gender != "female") %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	gender	correct	rt
23.06.2022	504546409	06.06.2000	demographics_3	yes	1	3724
23.06.2022	246532124	06.06.2000	demographics_3	Male	1	2793

# '%in%' is exactly equal to one of the following
dat_filtered_gender %>%
  filter(gender %in% c("female", "male")) %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	gender	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	female	1	619
23.06.2022	994692692	06.06.2000	demographics_3	female	1	3050

Weirdly, there is no single function that provides the combination of %in% and !=, i.e., ‘is not exactly equal to any of the following values’. This can be achieved with the following, although it is not intuitive. Note that it still returns rows with “Male” because we only told it to exclude rows with “male” (lower case).

# '!' and '%in%'
dat_filtered_gender %>%
  filter(!gender %in% c("female", "male")) %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	gender	correct	rt
23.06.2022	504546409	06.06.2000	demographics_3	yes	1	3724
23.06.2022	246532124	06.06.2000	demographics_3	Male	1	2793

You can also combine multiple tests with various operators:

& : AND, i.e., meeting both criterion.
| : OR, i.e., meeting either criterion.

# & criterion X and Y
dat_demographics_renamed %>%
  filter(question == "gender" & response == "female")  %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050

# | criterion X or Y
dat_demographics_renamed %>%
  filter(question == "gender" | response == "female")  %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3724
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2793

6.4.1.1 Exercise

Write code to:

Take the ‘dat_demographics_renamed’ data frame.
Create the data frame ‘dat_filtered_rt’.
Retain only rows where where reaction time (‘re’) is more than 500 ms.
Print a table of each data frame.

Click to show answer

dat_filtered_rt <- dat_demographics_renamed %>%
  filter(rt > 500) 

dat_filtered_rt %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
23.06.2022	548957868	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	1	3946
23.06.2022	504546409	06.06.2000	demographics_3	prolific ID	ad;oi98732paisdh	1	3724
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	2576
23.06.2022	994692692	06.06.2000	demographics_3	prolific ID	1023984cao982738	1	3050
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	1	6887
23.06.2022	246532124	06.06.2000	demographics_2	prolific ID	2982879q8w7r	1	6887

6.4.1.2 Exercise

Write code to:

Take the ‘dat_demographics_renamed’ data frame.
Create the data frame ‘dat_filtered_rt’.
Retain only rows where where both reaction time (‘rt’) is more than 500 ms and gender is either ‘male’, ‘female’, or ‘non-binary’.
Print a table of each data frame.

Click to show answer

dat_filtered_rt <- dat_demographics_renamed %>%
  filter(rt > 500 & question == 'gender' & response %in% c('male', 'female', 'non-binary')) 

dat_filtered_rt %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050

6.4.1.3 Exercise

In the above answer, why might it be a good idea to include question == 'gender'?

Click to show answer

Because the requirement was “…and gender is either ‘male’, ‘female’, or ‘non-binary’”.

If participants responded to another question (e.g., ‘age’) with ‘male’, ‘female’, or ‘non-binary’, their rows would also be returned.

Being specific to the requirements makes your code less ‘brittle’ and more likely to catch weird cases in larger datasets.

6.4.2 Positive vs. negative filters

As with select(), it is often better to define what you do want (== and %>%) rather than what you do not want (!= and !+%in%). This is even more important with filter() as unexpected row values are much more common than unexpected column names.

For example, if you specify a negative filter dat %>% filter(gender != "non-binary)" because you wish to filter to return only participants who identify as ‘male’ or ‘female’, you have in fact only excluded participants who specified ‘non-binary’. This code would also return participants with other response options such as ‘non binary’, ‘Non-Binary’, ‘NB’, etc.

However, there can be cases where you specifically do not want some rows or want to retain anything other than a given value. Protecting participant privacy is one of them.

Some rows of the ‘dat_demographics_renamed’ data frame contain “prolific ID” values, which could be linked back to the individual’s Prolific.com account where they completed the study, and therefore risk exposing identifying information. If this data was shared with other researchers or publicly, this could violate the informed consent agreement and/or local data privacy laws. (Note that the prolific IDs included in this data frame are made up.) This is discussed in more detail in the chapter on Data Privacy.

In this case, you would likely need to remove all rows referring to Prolific ID with a negative filter:

dat_demographics_anonymized <- dat_demographics_renamed %>%
  filter(question != "prolific ID")
  
dat_demographics_anonymized %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	1	3946
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	2576
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	1	6887

Note that the original ‘data_demographics_raw_messy.csv’ data file we originally read in to R still contains the Prolific ID codes! Again, see the Data Privacy chapter for how you would deal with this more thoroughly in your own projects.

6.5 Creating and modifying columns with `dplyr::mutate()`

mutate() is used to create new columns or to change the contents of existing ones.

For example, we can create the ‘rt_seconds’ column by dividing the ‘rt’ column, which is in milliseconds, by 1000.

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(rt_seconds = rt/1000)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt	rt_seconds
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1372	1.372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	619	0.619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1372	1.372
23.06.2022	548957868	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	619	0.619
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3946	3.946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3724	3.724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	1	3946	3.946
23.06.2022	504546409	06.06.2000	demographics_3	prolific ID	ad;oi98732paisdh	1	3724	3.724
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2576	2.576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3050	3.050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	2576	2.576
23.06.2022	994692692	06.06.2000	demographics_3	prolific ID	1023984cao982738	1	3050	3.050
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4311	4.311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2793	2.793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	1	6887	6.887
23.06.2022	246532124	06.06.2000	demographics_2	prolific ID	2982879q8w7r	1	6887	6.887

We can also alter existing columns through self-assignment, i.e., mutate()-ing a column based on itself. For example, modifying ‘rt’ to be in seconds by overwriting the existing ‘rt’ column:

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(rt = rt/1000)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	1	1.372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	1	0.619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	1.372
23.06.2022	548957868	06.06.2000	demographics_3	prolific ID	asldkjaao87809	1	0.619
23.06.2022	504546409	06.06.2000	demographics_2	age	48	1	3.946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	1	3.724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	1	3.946
23.06.2022	504546409	06.06.2000	demographics_3	prolific ID	ad;oi98732paisdh	1	3.724
23.06.2022	994692692	06.06.2000	demographics_2	age	21	1	2.576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	1	3.050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	1	2.576
23.06.2022	994692692	06.06.2000	demographics_3	prolific ID	1023984cao982738	1	3.050
23.06.2022	246532124	06.06.2000	demographics_2	age	999	1	4.311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	1	2.793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	1	6.887
23.06.2022	246532124	06.06.2000	demographics_2	prolific ID	2982879q8w7r	1	6.887

We can also change types, for example, converting the ‘correct’ column from numeric (0/1) to logical (TRUE/FALSE).

dat_demographics_mutated <- dat_demographics_renamed %>%
  mutate(correct = as.logical(correct))

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)

date	id	build	block_trial	question	response	correct	rt
23.06.2022	548957868	06.06.2000	demographics_2	age	23	TRUE	1372
23.06.2022	548957868	06.06.2000	demographics_3	gender	female	TRUE	619
23.06.2022	548957868	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	TRUE	1372
23.06.2022	548957868	06.06.2000	demographics_3	prolific ID	asldkjaao87809	TRUE	619
23.06.2022	504546409	06.06.2000	demographics_2	age	48	TRUE	3946
23.06.2022	504546409	06.06.2000	demographics_3	gender	yes	TRUE	3724
23.06.2022	504546409	06.06.2000	demographics_2	psychiatric diagnosis	OCD	TRUE	3946
23.06.2022	504546409	06.06.2000	demographics_3	prolific ID	ad;oi98732paisdh	TRUE	3724
23.06.2022	994692692	06.06.2000	demographics_2	age	21	TRUE	2576
23.06.2022	994692692	06.06.2000	demographics_3	gender	female	TRUE	3050
23.06.2022	994692692	06.06.2000	demographics_2	psychiatric diagnosis	schizophrenia	TRUE	2576
23.06.2022	994692692	06.06.2000	demographics_3	prolific ID	1023984cao982738	TRUE	3050
23.06.2022	246532124	06.06.2000	demographics_2	age	999	TRUE	4311
23.06.2022	246532124	06.06.2000	demographics_3	gender	Male	TRUE	2793
23.06.2022	246532124	06.06.2000	demographics_2	psychiatric diagnosis	GAD	TRUE	6887
23.06.2022	246532124	06.06.2000	demographics_2	prolific ID	2982879q8w7r	TRUE	6887

6.5.1 Longer pipes

A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:

dat_demographics_mutated <- dat_demographics_renamed %>%
  # select only columns of interest
  select(id, question, response, rt, correct) %>%
  # filter only rows of interest
  filter(question == "age") %>%
  # rename columns to make them more intuitive
  rename(age = response) %>%
  # mutate variables
  mutate(rt_ms = rt/1000,
         correct = as.logical(correct)) %>%
  # select needed columns again
  select(id, age, rt_ms, correct)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	age	rt_ms	correct
548957868	23	1.372	TRUE
504546409	48	3.946	TRUE
994692692	21	2.576	TRUE
246532124	999	4.311	TRUE

6.5.1.1 Exercise

Write code to:

Take the ‘dat_demographics_renamed’.
Create a data frame named ‘dat_demographics_mutated’.
Retain only the gender question rows.
Rename ‘response’ to ‘gender’.
Use the stringr::str_to_lower() function to convert ‘gender’ to lower case strings’.
Retain only the id and gender columns.
Print the data frame.

Click to show answer

library(stringr)

dat_demographics_mutated <- dat_demographics_renamed %>%
  filter(question == 'gender') %>%
  rename(gender = response) %>%
  mutate(gender = stringr::str_to_lower(gender)) %>%
  select(id, gender)

dat_demographics_mutated %>%
  kable() %>%
  kable_classic(full_width = FALSE)

id	gender
548957868	female
504546409	yes
994692692	female
246532124	male

6.5.2 More complex mutates

Most research reports (theses, articles, etc.) report the percentage of the sample that was male, female, etc. For example, the ‘dat_demographics_messy’ contains messy age and gender data for 500 participants.

We’ll use count() to show every unique response across participants. We’ll return to learn to use count() properly in the next chapter. For the moment, we’re more interested in the output than the code.

dat_demographics_messy %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)

gender	n
female	297
male	123
nonbinary	15
Female	8
FEMALE	5
femail	4
25	3
27	3
30	3
43	3
45	3
fem ale	3
femle	3
malee	3
18	2
26	2
38	2
40	2
MalE	2
femlae	2
mal	2
19	1
22	1
23	1
31	1
36	1
41	1
42	1
FeMale	1
ma le	1
non binary	1

Note how the data currently contains many misspellings and use of different cases that cause things like ‘female’ to appear as a different category to ‘Female’, etc.

6.5.2.1 `mutate()` and working with character strings

Working with character strings using {stringr} and ‘regular expressions’ (regex) is a large separate topic in itself.

For the moment, know that you can:

Use functions like str_to_lower() to convert all text to lower case (as in the above example).
Use str_remove_all() to remove everything except letters (str_remove_all(x, "[^A-Za-z]")) or everything except numbers (str_remove_all(x, "[^0-9\\.]")).

Above, you used {stringr}’s str_to_lower() function to convert all letters to lower case. Let’s apply that here to quickly improve things:

dat_demographics_messy %>%
  # convert to lower case
  mutate(gender = str_to_lower(gender)) %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)

gender	n
female	311
male	125
nonbinary	15
femail	4
25	3
27	3
30	3
43	3
45	3
fem ale	3
femle	3
malee	3
18	2
26	2
38	2
40	2
femlae	2
mal	2
19	1
22	1
23	1
31	1
36	1
41	1
42	1
ma le	1
non binary	1

Next we’ll use str_remove_all(x, "[^A-Za-z]") to remove everything except letters:

dat_demographics_messy %>%
  mutate(
    # convert to lower case
    gender = str_to_lower(gender),
    # remove all non-letters
    gender = str_remove_all(gender, "[^A-Za-z]")
  ) %>%
  # find unique values of 'gender' and the frequency of each
  count(gender) %>%
  # arrange these unique cases by descending frequency
  arrange(desc(n)) %>%
  # print table
  kable() %>%
  kable_classic(full_width = FALSE)

gender	n
female	314
male	126
	30
nonbinary	16
femail	4
femle	3
malee	3
femlae	2
mal	2

6.5.2.2 `mutate()` and conditionals rules

6.5.2.2.1 Using `dplyr::if_else()`

If we were being somewhat lazy, we could keep just the three most common self-reported responses to the gender, i.e. ‘male’, ‘female’ and ‘non-binary’, and change the other responses to NA.

Many forms of data cleaning or ‘wrangling’ involve applying conditional rules like this to clean up irregular responses, i.e., “if gender is not ‘male’, ‘female’ or ‘non-binary’, then set it to NA”.

The simplest of these is dplyr::if_else(), which applies just one if-else. Note that there is a similar function in Base-R, ifelse(), but for technical reasons this function is much less safe and can cause weird errors. Try to always use if_else().

dat_gender_tidier <- dat_demographics_messy %>%
  # convert to lower case and remove non-letters
  mutate(gender = str_to_lower(gender),
         gender = str_remove_all(gender, "[^A-Za-z]")) %>%
  # tidy up cases
  mutate(gender = if_else(condition = gender %in% c("female", "male", "nonbinary"), # the logical test applied: is 'gender' one of the following
                          true = gender, # what to do if the test is passed: keep the original gender response
                          false = NA_character_)) # what to do if the test is failed: set it to NA

dat_gender_tidier %>%
  count(gender) %>%
  arrange(desc(n)) %>%
  kable() %>%
  kable_classic(full_width = FALSE)

gender	n
female	314
male	126
NA	44
nonbinary	16

This somewhat quick and dirty solution gives us just four gender categories, ‘male’, ‘female’, ‘non-binary’ and missing (NA).

Of course, these responses were (mostly) not really missing, we removed them to make the responses simpler. This isn’t very good practice, as we’re throwing away data.

6.5.2.2.2 Using `dplyr::case_when()`

case_when() allows you to check against multiple logical tests. It works like this:

dat %>%
  # create the variable variable_to_create based on:
  mutate(variable_to_create = case_when(
    # if logical_test_1 is met, set variable_to_create to value_1
    logical_test_1 ~ value_1, 
    # if logical_test_2 is met, set variable_to_create to value_2
    logical_test_2 ~ value_2, 
    # if logical_test_3 is met, set variable_to_create to value_3
    logical_test_3 ~ value_3, 
    # 'if none of the above are met, set to the default value value_4'
    TRUE ~ value_4
  ))

Note that the logical tests are checked in order, so you often have to consider the order in which they are listed and run.

We can apply it like this:

dat_gender_tidier <- dat_demographics_messy %>%
  # convert to lower case and remove non-letters
  mutate(gender = str_to_lower(gender),
         gender = str_remove_all(gender, "[^A-Za-z]")) %>%
  # tidy up cases
  mutate(gender = case_when(gender %in% c("female", "male", "nonbinary") ~ gender, 
                            gender %in% c("femail", "fem ale", "femle", "femlae") ~ "female",
                            gender %in% c("malee", "mal", "ma le") ~ "male",
                            gender %in% c("non binary", "nonbinary") ~ "non-binary",
                            gender == "" ~ NA_character_, # if an empty character string, change to NA_character_
                            TRUE ~ gender)) # if none of the above apply, keep original value

dat_gender_tidier %>%
  count(gender) %>%
  arrange(desc(n)) %>%
  kable() %>%
  kable_classic(full_width = FALSE)

gender	n
female	323
male	131
NA	30
nonbinary	16

Note that the contents of the case_when() conditions have to be constructed iteratively. Use distinct() to inspect what cases are present, then write code to merge them as desired.

6.6 Exercises

6.6.1 `select()` vs. `filter()`

What is the difference between select and filter?

Click to show answer

select() is for columns, filter() is for rows.

6.6.2 {tidyselect} helper functions

Name some helper functions from {tidyselect}.

Click to show answer

starts_with()
ends_with()
contains()
all_of()
any_of()
where()
everything()

Explain what each of them do.

Click to show answer

starts_with() : Selects columns ending in a character string
ends_with() : Selects columns ending in a character string
contains() : Selects columns containing a character string, not necessarily at the start or end
all_of() : Selects all of the column names in a vector e.g., all_of(c("a", "b"))
any_of() : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
where() : Select columns that meet a criterion, e.g., where(is.numeric) selects only columns containing numeric data.
everything() : Selects all columns.

6.6.3 Positive vs. negative selections and filters

Explain the difference between positive and negative selections in select() and filter() and how they are done in each.

Click to show answer

In select(), positive selections mean keep these columns (e.g., select(mpg, cyl)), while negative selections mean drop these columns (e.g., select(-mpg, -cyl)).
In filter(), conditions act positively by default (e.g., filter(mpg > 20) keeps rows with mpg > 20). To “negate” a condition, you use ! or inequality operators (e.g., filter(!(mpg > 20)) or filter(mpg <= 20)).

6.6.4 Set operators

What set operators can be used with select() and how?

Click to show answer

select() supports set operations on columns using tidyselect helpers:

- : negative selection (drop columns)
c() : combine multiple selections (union)
! : invert a selection rule
& : must meet both selection rules (intersection)
| : can meet either selection rule (union)

Example:

select(dat, contains("scale3") & contains("item1") & !ends_with("item1"))

What set operators can be used with filter() and how?

Click to show answer

With filter(), set operations are expressed with logical operators applied to rows:

& : keeps rows that satisfy all conditions (intersection)
| : keeps rows that satisfy any condition (union)
! : excludes rows that meet a condition (negation/difference)

Example:

dat %>%
  filter(gender == "female" & age > 30 & !is.na(income))

6.6.5 Interactive exercises

Complete the interactive exercises for:

6.6.6 Practice data transformation using multiple functions

In your local version of this .qmd file:

Use the ‘dat_demographics_messy’ to create a new data frame called ‘dat_age_tidier’
Use disinct() to understand what unique responses are present in the ‘age’ column.
Use the other functions you learned in this chapter to clean the age column so that it contains only numbers.
Convert the age variable to a numeric variable with as.numeric().
Print the unique responses present in the processed ‘age’ column as a table.

6.1 Keeping or dropping columns with dplyr::select()

6.1.1 Positive vs. negative selections

6.1.1.1 Exercise

6.1.2 {tidyselect} helper functions

6.1.2.1 Exercise

6.1.2.2 Exercise

6.1.2.3 Exercise

6.1.2.4 Exercise

6.1.3 Selection with set operations

6.1.3.1 Exercise

6.1.4 Practicing select and the pipe in a longer chunk

6.1.4.1 Exercise

6.2 Moving columns with dplyr::relocate()

6.2.0.1 Exercise

6.2.0.2 Exercise

6.3 Renaming, relocating, and selecting all-in-one with select()

6.4 Keeping or dropping rows with dplyr::filter()

6.4.0.1 Exercise

6.4.1 Filters with set operations

6.4.1.1 Exercise

6.4.1.2 Exercise

6.4.1.3 Exercise

6.4.2 Positive vs. negative filters

6.5 Creating and modifying columns with dplyr::mutate()

6.5.1 Longer pipes

6.5.1.1 Exercise

6.5.2 More complex mutates

6.5.2.1 mutate() and working with character strings

6.5.2.2 mutate() and conditionals rules

6.5.2.2.1 Using dplyr::if_else()

6.5.2.2.2 Using dplyr::case_when()

6.6 Exercises

6.6.1 select() vs. filter()

6.6.2 {tidyselect} helper functions

6.6.3 Positive vs. negative selections and filters

6.6.4 Set operators

6.6.5 Interactive exercises

6.6.6 Practice data transformation using multiple functions

6.1 Keeping or dropping columns with `dplyr::select()`

6.2 Moving columns with `dplyr::relocate()`

6.3 Renaming, relocating, and selecting all-in-one with `select()`

6.4 Keeping or dropping rows with `dplyr::filter()`

6.5 Creating and modifying columns with `dplyr::mutate()`

6.5.2.1 `mutate()` and working with character strings

6.5.2.2 `mutate()` and conditionals rules

6.5.2.2.1 Using `dplyr::if_else()`

6.5.2.2.2 Using `dplyr::case_when()`

6.6.1 `select()` vs. `filter()`