# Data transformation I
```{r}
#| include: false
# settings, placed in a chunk that will not show in the .html file (because include=FALSE)
# disables scientific notation so that small numbers appear as eg "0.00001" rather than "1e-05"
options(scipen = 999)
```
The next few chapters cover various {tidyverse} functions - sometimes called verbs - that are used for data transformation.
This chapter covers functions that are "*non-aggregating*": they allow you to change which columns or rows are present in the dataset, to create new columns or split up existing ones, to change their contents, etc. `rename()`, which you learned in the previous chapter, is one of these functions.
## Keeping or dropping columns with `dplyr::select()`
Not all columns in a dataset are useful to us. When processing data, we often wish to keep only some columns and drop others. We can do this with `dplyr::select()`.
We'll start where the last chapter ended, by loading dependencies, reading in the 'data_demographics_raw_messy.csv' dataset, and renaming the columns to make them easier to work with.
```{r}
library(readr) # for read_csv()
library(janitor) # for clean_names()
library(dplyr) # for %>%
library(knitr) # for kable()
library(kableExtra) # for kable_extra()
dat_demographics_renamed <-
read_csv(file = "../data/raw/data_demographics_raw_messy.csv",
skip = 2) %>%
clean_names() %>%
rename(id = subject_code,
block_trial = block_code_and_trial_number,
question = trial_code,
response = key_response_use_this,
rt = x0_ms_onset_rt)
dat_demographics_renamed %>%
kable() %>%
kable_classic(full_width = FALSE)
```
If our data processing goal is to extract and tidy responses on the demographics questionnaire, we only need some of these columns: id, question, and response. We can use `select()` and the pipe to retain only these.
```{r}
dat_demographics_selected <- dat_demographics_renamed %>%
select(id,
question,
response)
dat_demographics_selected %>%
kable() %>%
kable_classic(full_width = FALSE)
```
What code would you write to select the 'date' and 'correct' columns?
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_demographics_selected <- dat_demographics_renamed %>%
select(date,
correct)
```
:::
What code would you write to print all the columns present in 'dat_demographics_renamed' in a way you can easily paste into a `select()` call? This was covered in [the previous chapter](the_pipe_and_renaming.qmd).
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_demographics_renamed %>%
colnames() %>%
dput()
```
:::
### Positive vs. negative selections
The above example tells `select()` the columns to retain. You could instead tell it which ones to drop. This is a negative selection: you preface the column name with `-`.
```{r}
dat_demographics_selected <- dat_demographics_renamed %>%
select(-date,
-build,
-block_trial,
-correct,
-rt)
dat_demographics_selected %>%
kable() %>%
kable_classic(full_width = FALSE)
```
It's tempting to think in terms of negative selections, i.e., getting rid of columns you don't want. However, you should in general use positive selections and avoid negative ones. If the underlying data changes in some way, e.g., in a prior processing step, it can contain variables your negative selection doesn't account for. For example, if 'dat_demographics_renamed' now included another column, "completed_study" (TRUE/FALSE), the positive selection would not include this variable as it wasn't listed, but the negative selection would as it was not specifically excluded. This can produce code that is more 'fragile' to changes in the data.
In general, tell select what you *do want*, not what you *don't want*.
#### Exercise
Check your understanding of positive and negative selections:
1. What would happen if you mix positive and negative selections? i.e., What columns will be present in the output when your run the code below?
```{r}
#| eval: false
dat_demographics_selected <- dat_demographics_renamed %>%
select(id,
question,
response,
-correct)
dat_demographics_selected %>%
kable() %>%
kable_classic(full_width = FALSE)
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
#| eval: true
#| include: false
dat_demographics_selected <- dat_demographics_renamed %>%
select(id,
question,
response,
-correct)
dat_demographics_selected %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
2. Why are those columns present and not others?
::: {.callout-note collapse="true" title="Click to show answer"}
Once positive selections are present in the select call, the negative ones are redundant. You've already told `select()` what to include, which implies what not to include.
:::
### {tidyselect} helper functions
Often, we have large datasets with many columns. `select()` calls can be made simpler with the help of another {tidyverse} package, {tidyselect}.
To illustrate this, we'll use a simulated dataset called 'dat_many_columns' that contains, you guessed it, many columns.
```{r}
#| include: false # don't show code in the ebook
# this block simulates a dataset with many columns using the {truffle} package
# install the packages devtools and truffles if not already installed
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
if (!requireNamespace("truffle", quietly = TRUE)) {
devtools::install_github("ianhussey/truffle")
}
library(truffle)
library(stringr)
dat_many_columns <-
truffle_likert(n_per_condition = 5,
factors = c("X1_latent", "X2_latent", "X3_latent"),
prefixes = c("X1_item", "X2_item", "X3_item"),
alpha = c(.70, .75, .80),
n_items = c(10, 7, 15),
n_levels = 7,
r_among_outcomes = 0.50,
approx_d_between_groups = c(0.50, 0.20, 1.00),
seed = 42) %>%
truffle_demographics() %>%
rename_with(~ str_replace_all(.x, "X", "depression_scale"))
```
Let's start by printing the column names:
```{r}
dat_many_columns %>%
colnames() %>%
dput()
```
#### Exercise
Write code to:
1. Create a new data frame called 'dat_fewer_columns'
2. `select()` all the columns beginning with "depression", using only positive selections
3. Print the column names to confirm that these are the only ones present
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_fewer_columns <- dat_many_columns %>%
select(depression_scale1_item1,
depression_scale1_item2,
depression_scale1_item3,
depression_scale1_item4,
depression_scale1_item5,
depression_scale1_item6,
depression_scale1_item7,
depression_scale1_item8,
depression_scale1_item9,
depression_scale1_item10,
depression_scale2_item1,
depression_scale2_item2,
depression_scale2_item3,
depression_scale2_item4,
depression_scale2_item5,
depression_scale2_item6,
depression_scale2_item7,
depression_scale3_item1,
depression_scale3_item2,
depression_scale3_item3,
depression_scale3_item4,
depression_scale3_item5,
depression_scale3_item6,
depression_scale3_item7,
depression_scale3_item8,
depression_scale3_item9,
depression_scale3_item10,
depression_scale3_item11,
depression_scale3_item12,
depression_scale3_item13,
depression_scale3_item14,
depression_scale3_item15)
dat_fewer_columns %>%
colnames() %>%
dput()
```
:::
This can be done more easily with `tidyselect::starts_with()`:
```{r}
library(tidyselect)
dat_fewer_columns <- dat_many_columns %>%
select(starts_with("depression"))
dat_fewer_columns %>%
colnames() %>%
dput()
```
There are many other {tidyselect} helper functions including:
- `starts_with()` : Selects columns ending in a character string
- `ends_with()` : Selects columns ending in a character string
- `contains()` : Selects columns containing a character string, not necessarily at the start or end
- `all_of()` : Selects all of the column names in a vector e.g., `all_of(c("a", "b"))`
- `any_of()` : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
- `where()` : Select columns that meet a criterion, e.g., `where(is.numeric)` selects only columns containing numeric data.
- `everything()` : Selects all columns. This will seem redundant right now, but will become useful later on when we cover `pivot_` functions
#### Exercise
Write code to select all columns belonging to scale 2 from the 'dat_many_columns' data frame.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_fewer_columns <- dat_many_columns %>%
select(contains("scale2"))
dat_fewer_columns %>%
colnames() %>%
dput()
```
:::
#### Exercise
Write code to select from the 'dat_many_columns' data frame the "id", "age" and "gender" columns AND all columns related to item3.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_fewer_columns <- dat_many_columns %>%
select(id, age, gender, contains("item3"))
dat_fewer_columns %>%
colnames() %>%
dput()
```
:::
#### Exercise
Write code to select the "id", "age" and "gender" columns AND all columns related to item1.
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_fewer_columns <- dat_many_columns %>%
select(id, age, gender, ends_with("item1"))
dat_fewer_columns %>%
colnames() %>%
dput()
```
:::
### Selection with set operations
Negative selection using `-` is selection using a "set operation" or a set of logical rules that determine which column names are in the selected set. There are others:
- Set operations:
- `-` negative selection
- `c()` : combine selections
- `!` invert selection
- `&` must meet both selection rules
- `|` can meet either selection criteria
For example: select columns that refer to scale 3 AND items \>=10:
```{r}
dat_fewer_columns <- dat_many_columns %>%
select(id, age, gender,
# items containing scale3 AND item1 but NOT item 1
c(contains("scale3") & contains("item1") & !ends_with("item1")))
dat_fewer_columns %>%
colnames() %>%
dput()
```
#### Exercise
To practice {tidyselect} helpers and selection set operations, write code in a creative and unnecessarily complex way to select the columns "id", "gender", and "age", and the items from scale 1 and scale 3 that whose items are \<= 10.
Note: no solution provided here as many different ones are possible.
```{r}
#| include: false
dat_many_columns %>%
select(
c(starts_with("i") & ends_with("d") & !ends_with("z")),
c(contains("g") & contains("e") & !starts_with("h")),
c(
(starts_with("depression") & contains("item")) & (contains("scale1") | contains("scale3")) &
!(ends_with("11") | ends_with("12") | ends_with("13") | ends_with("14") | ends_with("15"))
)
) |>
colnames()
dat_many_columns %>%
select(id, age, gender,
# Scale 1 & 3 with Items from 1-10
(contains("scale1") | contains("scale3")) & !ends_with(c("11", "12", "13", "14", "15"))) %>%
colnames()
dat_many_columns %>%
select(id, age, gender,
# Scale 1 & 3 with Items from 1-10
contains("scale1") | contains("scale3") & !ends_with(c("11", "12", "13", "14", "15"))) %>%
colnames()
dat_many_columns %>%
select(id, age, gender,
# Scale 1 & 3 with Items from 1-10
contains("scale1") & contains("scale3") & ends_with(c("1", "2", "3"))) %>%
colnames()
#& ends_with(x <= 10)))
```
### Practicing select and the pipe in a longer chunk
#### Exercise
The first chunk in this chapter read in the 'data_demographics_raw_messy.csv' file and renamed columns.
- Rewrite that chunk here - as much as possible from memory, without simply copy pasting.
- Add another pipe and selecting the "id", "question" and "response" columns.
- Print a table of the resulting data frame using `kable()` and `kable_classic()`
Note: no solution provided here to reduce the temptation to peek or copy-paste.
```{r}
#| include: false
```
## Moving columns with `dplyr::relocate()`
You can reorder to columns within a data frame with `relocate()`.
Unlike `rename()`, `select()`, and other functions, only one column can be relocated with each `relocate()` call.
You can relocate columns before or after others, or to locations such as being the first or last column. This code moves the "response" column to be before the "question" column, and the "correct" column to be the last one.
```{r}
dat_demographics_renamed %>%
relocate(response, .before = "question") %>%
relocate(correct, .after = last_col()) %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
```
#### Exercise
What code is needed to relocate the columns of 'dat_demographics_renamed' so that they are in the order "id", "question", "response", and then the remaining columns.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_demographics_reordered <- dat_demographics_renamed %>%
relocate(id, .before = 1) %>%
relocate(question, .after = "id") %>%
relocate(response, .after = "question")
dat_demographics_reordered %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
#### Exercise
All three of the solutions below accomplish the goal of making the "id" column the first column:
```{r}
# move "id" to be the first column by placing it before "date"
dat_demographics_renamed %>%
relocate(id, .before = "date") %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
# move "date" to be the second column by placing it after "id", producing the same result
dat_demographics_renamed %>%
relocate(date, .after = "id") %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
# move "id" to be the 1st column by placing it before the first column by location, producing the same result
dat_demographics_renamed %>%
relocate(id, .before = 1) %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
```
<br> Which of the above solutions is the least 'fragile' solution, i.e., that is robust to changes in the data or elsewhere in your data processing code?
::: {.callout-note collapse="true" title="Click to show answer"}
`relocate(id, .before = 1)` does not rely on the name and location of another column and is therefore the least fragile solution.
:::
## Renaming, relocating, and selecting all-in-one with `select()`
Renaming, relocating, and selecting columns are so common as a set of three tasks that you can do it all within `select()`.
Whatever order columns appear in `select()` they will be relocated to. You can rename variables inside select just as you do in `rename()` using `new_name = old_name`.
```{r}
dat_demographics_renamed %>%
select(id, date, question, answer = response) %>%
head() %>%
kable() %>%
kable_classic(full_width = FALSE)
```
## Keeping or dropping rows with `dplyr::filter()`
Not all rows in a dataset are useful to us. When processing data, we often wish to keep only some rows and drop others. We can do this with `dplyr::filter()`.
It is important to learn the difference between `select()` and `filter()`, it is initially confusing.
- `select()` keeps or drops **columns**
- `filter()` keeps or drops **rows**
For example, filter the 'question' column to contain only "age" items:
```{r}
dat_filtered_age <- dat_demographics_renamed %>%
filter(question == "age")
dat_filtered_age %>%
kable() %>%
kable_classic(full_width = FALSE)
```
#### Exercise
Write code to:
- Take the 'dat_demographics_renamed' data frame.
- Create two data frames called 'dat_filtered_age' and 'dat_filtered_gender'.
- Retain only the relevant rows of 'question' for each.
- Rename the 'response' columns to what variable it now contains.
- Drop the 'question' columns since they're now irrelevant.
- Print a table of each data frame.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_filtered_age <- dat_demographics_renamed %>%
filter(question == "age") %>%
rename(age = response) %>%
select(-question)
dat_filtered_age %>%
kable() %>%
kable_classic(full_width = FALSE)
dat_filtered_gender <- dat_demographics_renamed %>%
filter(question == "gender") %>%
rename(gender = response) %>%
select(-question)
dat_filtered_gender %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
### Filters with set operations
You can specify the logical test for filtering in many ways, including:
- `==` : 'is exactly equal to this one value'
- `!=` : 'is not exactly equal to
- `%in%` : 'is exactly equal to one of the following values'
Note that equivalence is two equals signs (`==`) whereas setting arguments in functions is one (`=`).
```{r}
# '==' exactly equal to
dat_filtered_gender %>%
filter(gender == "female") %>%
kable() %>%
kable_classic(full_width = FALSE)
# '!=' is not exactly equal to
dat_filtered_gender %>%
filter(gender != "female") %>%
kable() %>%
kable_classic(full_width = FALSE)
# '%in%' is exactly equal to one of the following
dat_filtered_gender %>%
filter(gender %in% c("female", "male")) %>%
kable() %>%
kable_classic(full_width = FALSE)
```
Weirdly, there is no single function that provides the combination of `%in%` and `!=`, i.e., 'is not exactly equal to any of the following values'. This can be achieved with the following, although it is not intuitive. Note that it still returns rows with "Male" because we only told it to exclude rows with "male" (lower case).
```{r}
# '!' and '%in%'
dat_filtered_gender %>%
filter(!gender %in% c("female", "male")) %>%
kable() %>%
kable_classic(full_width = FALSE)
```
You can also combine multiple tests with various operators:
- `&` : AND, i.e., meeting both criterion.
- `|` : OR, i.e., meeting either criterion.
```{r}
# & criterion X and Y
dat_demographics_renamed %>%
filter(question == "gender" & response == "female") %>%
kable() %>%
kable_classic(full_width = FALSE)
# | criterion X or Y
dat_demographics_renamed %>%
filter(question == "gender" | response == "female") %>%
kable() %>%
kable_classic(full_width = FALSE)
```
#### Exercise
Write code to:
- Take the 'dat_demographics_renamed' data frame.
- Create the data frame 'dat_filtered_rt'.
- Retain only rows where where reaction time ('re') is more than 500 ms.
- Print a table of each data frame.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_filtered_rt <- dat_demographics_renamed %>%
filter(rt > 500)
dat_filtered_rt %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
#### Exercise
Write code to:
- Take the 'dat_demographics_renamed' data frame.
- Create the data frame 'dat_filtered_rt'.
- Retain only rows where where both reaction time ('rt') is more than 500 ms and gender is either 'male', 'female', or 'non-binary'.
- Print a table of each data frame.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
dat_filtered_rt <- dat_demographics_renamed %>%
filter(rt > 500 & question == 'gender' & response %in% c('male', 'female', 'non-binary'))
dat_filtered_rt %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
#### Exercise
In the above answer, why might it be a good idea to include `question == 'gender'`?
::: {.callout-note collapse="true" title="Click to show answer"}
Because the requirement was "...and gender is either 'male', 'female', or 'non-binary'".
If participants responded to another question (e.g., 'age') with 'male', 'female', or 'non-binary', their rows would also be returned.
Being specific to the requirements makes your code less 'brittle' and more likely to catch weird cases in larger datasets.
:::
### Positive vs. negative filters
As with `select()`, it is often better to define what you *do* want (`==` and `%>%`) rather than what you *do not* want (`!=` and `!`+`%in%`). This is even more important with `filter()` as unexpected row values are much more common than unexpected column names.
For example, if you specify a negative filter `dat %>% filter(gender != "non-binary)"` because you wish to filter to return only participants who identify as 'male' or 'female', you have in fact only excluded participants who specified 'non-binary'. This code would also return participants with other response options such as 'non binary', 'Non-Binary', 'NB', etc.
However, there can be cases where you specifically do not want some rows or want to retain anything other than a given value. Protecting participant privacy is one of them.
Some rows of the 'dat_demographics_renamed' data frame contain "prolific ID" values, which could be linked back to the individual's Prolific.com account where they completed the study, and therefore risk exposing identifying information. If this data was shared with other researchers or publicly, this could violate the informed consent agreement and/or local data privacy laws. (Note that the prolific IDs included in this data frame are made up.) This is discussed in more detail in the chapter on [Data Privacy](privacy.qmd).
In this case, you would likely need to remove all rows referring to Prolific ID with a negative filter:
```{r}
dat_demographics_anonymized <- dat_demographics_renamed %>%
filter(question != "prolific ID")
dat_demographics_anonymized %>%
kable() %>%
kable_classic(full_width = FALSE)
```
Note that the original 'data_demographics_raw_messy.csv' data file we originally read in to R still contains the Prolific ID codes! Again, see the [Data Privacy](privacy.qmd) chapter for how you would deal with this more thoroughly in your own projects.
## Creating and modifying columns with `dplyr::mutate()`
`mutate()` is used to create new columns or to change the contents of existing ones.
For example, we can create the 'rt_seconds' column by dividing the 'rt' column, which is in milliseconds, by 1000.
```{r}
dat_demographics_mutated <- dat_demographics_renamed %>%
mutate(rt_seconds = rt/1000)
dat_demographics_mutated %>%
kable() %>%
kable_classic(full_width = FALSE)
```
We can also alter existing columns through self-assignment, i.e., `mutate()`-ing a column based on itself. For example, modifying 'rt' to be in seconds by overwriting the existing 'rt' column:
```{r}
dat_demographics_mutated <- dat_demographics_renamed %>%
mutate(rt = rt/1000)
dat_demographics_mutated %>%
kable() %>%
kable_classic(full_width = FALSE)
```
We can also change types, for example, converting the 'correct' column from numeric (0/1) to logical (TRUE/FALSE).
```{r}
dat_demographics_mutated <- dat_demographics_renamed %>%
mutate(correct = as.logical(correct))
dat_demographics_mutated %>%
kable() %>%
kable_classic(full_width = FALSE)
```
### Longer pipes
A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:
```{r}
dat_demographics_mutated <- dat_demographics_renamed %>%
# select only columns of interest
select(id, question, response, rt, correct) %>%
# filter only rows of interest
filter(question == "age") %>%
# rename columns to make them more intuitive
rename(age = response) %>%
# mutate variables
mutate(rt_ms = rt/1000,
correct = as.logical(correct)) %>%
# select needed columns again
select(id, age, rt_ms, correct)
dat_demographics_mutated %>%
kable() %>%
kable_classic(full_width = FALSE)
```
#### Exercise
Write code to:
- Take the 'dat_demographics_renamed'.
- Create a data frame named 'dat_demographics_mutated'.
- Retain only the gender question rows.
- Rename 'response' to 'gender'.
- Use the `stringr::str_to_lower()` function to convert 'gender' to lower case strings'.
- Retain only the id and gender columns.
- Print the data frame.
```{r}
#| include: false
```
::: {.callout-note collapse="true" title="Click to show answer"}
```{r}
library(stringr)
dat_demographics_mutated <- dat_demographics_renamed %>%
filter(question == 'gender') %>%
rename(gender = response) %>%
mutate(gender = stringr::str_to_lower(gender)) %>%
select(id, gender)
dat_demographics_mutated %>%
kable() %>%
kable_classic(full_width = FALSE)
```
:::
### More complex mutates
```{r}
#| include: false
# generate some messy demographics data with a larger N
# devtools::install_github("ianhussey/truffle")
library(truffle)
set.seed(42)
dat_demographics_messy <- data.frame(id = 1:500) %>%
truffle::truffle_demographics() %>%
truffle::dirt_demographics()
```
Most research reports (theses, articles, etc.) report the percentage of the sample that was male, female, etc. For example, the 'dat_demographics_messy' contains messy age and gender data for 500 participants.
We'll use `count()` to show every unique response across participants. We'll return to learn to use `count()` properly in the next chapter. For the moment, we're more interested in the output than the code.
```{r}
dat_demographics_messy %>%
# find unique values of 'gender' and the frequency of each
count(gender) %>%
# arrange these unique cases by descending frequency
arrange(desc(n)) %>%
# print table
kable() %>%
kable_classic(full_width = FALSE)
```
Note how the data currently contains many misspellings and use of different cases that cause things like 'female' to appear as a different category to 'Female', etc.
#### `mutate()` and working with character strings
Working with character strings using {stringr} and 'regular expressions' (regex) is a large separate topic in itself.
For the moment, know that you can:
- Use functions like `str_to_lower()` to convert all text to lower case (as in the above example).
- Use `str_remove_all()` to remove everything except letters (`str_remove_all(x, "[^A-Za-z]")`) or everything except numbers (`str_remove_all(x, "[^0-9\\.]")`).
Above, you used {stringr}'s `str_to_lower()` function to convert all letters to lower case. Let's apply that here to quickly improve things:
```{r}
dat_demographics_messy %>%
# convert to lower case
mutate(gender = str_to_lower(gender)) %>%
# find unique values of 'gender' and the frequency of each
count(gender) %>%
# arrange these unique cases by descending frequency
arrange(desc(n)) %>%
# print table
kable() %>%
kable_classic(full_width = FALSE)
```
Next we'll use `str_remove_all(x, "[^A-Za-z]")` to remove everything except letters:
```{r}
dat_demographics_messy %>%
mutate(
# convert to lower case
gender = str_to_lower(gender),
# remove all non-letters
gender = str_remove_all(gender, "[^A-Za-z]")
) %>%
# find unique values of 'gender' and the frequency of each
count(gender) %>%
# arrange these unique cases by descending frequency
arrange(desc(n)) %>%
# print table
kable() %>%
kable_classic(full_width = FALSE)
```
#### `mutate()` and conditionals rules
##### Using `dplyr::if_else()`
If we were being somewhat lazy, we could keep just the three most common self-reported responses to the gender, i.e. 'male', 'female' and 'non-binary', and change the other responses to `NA`.
Many forms of data cleaning or 'wrangling' involve applying conditional rules like this to clean up irregular responses, i.e., "if gender is not 'male', 'female' or 'non-binary', then set it to `NA`".
The simplest of these is `dplyr::if_else()`, which applies just one if-else. Note that there is a similar function in Base-R, `ifelse()`, but for technical reasons this function is much less safe and can cause weird errors. Try to always use `if_else()`.
```{r}
dat_gender_tidier <- dat_demographics_messy %>%
# convert to lower case and remove non-letters
mutate(gender = str_to_lower(gender),
gender = str_remove_all(gender, "[^A-Za-z]")) %>%
# tidy up cases
mutate(gender = if_else(condition = gender %in% c("female", "male", "nonbinary"), # the logical test applied: is 'gender' one of the following
true = gender, # what to do if the test is passed: keep the original gender response
false = NA_character_)) # what to do if the test is failed: set it to NA
dat_gender_tidier %>%
count(gender) %>%
arrange(desc(n)) %>%
kable() %>%
kable_classic(full_width = FALSE)
```
This somewhat quick and dirty solution gives us just four gender categories, 'male', 'female', 'non-binary' and missing (`NA`).
Of course, these responses were (mostly) not *really* missing, we removed them to make the responses simpler. This isn't very good practice, as we're throwing away data.
##### Using `dplyr::case_when()`
`case_when()` allows you to check against multiple logical tests. It works like this:
```{r}
#| eval: false
#| include: true
dat %>%
# create the variable variable_to_create based on:
mutate(variable_to_create = case_when(
# if logical_test_1 is met, set variable_to_create to value_1
logical_test_1 ~ value_1,
# if logical_test_2 is met, set variable_to_create to value_2
logical_test_2 ~ value_2,
# if logical_test_3 is met, set variable_to_create to value_3
logical_test_3 ~ value_3,
# 'if none of the above are met, set to the default value value_4'
TRUE ~ value_4
))
```
Note that the logical tests are checked in order, so you often have to consider the order in which they are listed and run.
We can apply it like this:
```{r}
dat_gender_tidier <- dat_demographics_messy %>%
# convert to lower case and remove non-letters
mutate(gender = str_to_lower(gender),
gender = str_remove_all(gender, "[^A-Za-z]")) %>%
# tidy up cases
mutate(gender = case_when(gender %in% c("female", "male", "nonbinary") ~ gender,
gender %in% c("femail", "fem ale", "femle", "femlae") ~ "female",
gender %in% c("malee", "mal", "ma le") ~ "male",
gender %in% c("non binary", "nonbinary") ~ "non-binary",
gender == "" ~ NA_character_, # if an empty character string, change to NA_character_
TRUE ~ gender)) # if none of the above apply, keep original value
dat_gender_tidier %>%
count(gender) %>%
arrange(desc(n)) %>%
kable() %>%
kable_classic(full_width = FALSE)
```
Note that the contents of the `case_when()` conditions have to be constructed iteratively. Use `distinct()` to inspect what cases are present, then write code to merge them as desired.
## Exercises
### `select()` vs. `filter()`
What is the difference between select and filter?
::: {.callout-note collapse="true" title="Click to show answer"}
`select()` is for columns, `filter()` is for rows.
:::
### {tidyselect} helper functions
Name some helper functions from {tidyselect}.
::: {.callout-note collapse="true" title="Click to show answer"}
- `starts_with()`
- `ends_with()`
- `contains()`
- `all_of()`
- `any_of()`
- `where()`
- `everything()`
:::
Explain what each of them do.
::: {.callout-note collapse="true" title="Click to show answer"}
- `starts_with()` : Selects columns ending in a character string
- `ends_with()` : Selects columns ending in a character string
- `contains()` : Selects columns containing a character string, not necessarily at the start or end
- `all_of()` : Selects all of the column names in a vector e.g., `all_of(c("a", "b"))`
- `any_of()` : Selects any of the column names in a vector that are present, silently dropping missing ones. This can be very useful when columns may or may not be present, but has the risk of creating silent errors.
- `where()` : Select columns that meet a criterion, e.g., `where(is.numeric)` selects only columns containing numeric data.
- `everything()` : Selects all columns.
:::
### Positive vs. negative selections and filters
Explain the difference between positive and negative selections in `select()` and `filter()` and how they are done in each.
::: {.callout-note collapse="true" title="Click to show answer"}
- In `select()`, positive selections mean *keep* these columns (e.g., `select(mpg, cyl)`), while negative selections mean *drop* these columns (e.g., `select(-mpg, -cyl)`).
- In `filter()`, conditions act positively by default (e.g., `filter(mpg > 20)` keeps rows with mpg \> 20). To "negate" a condition, you use `!` or inequality operators (e.g., `filter(!(mpg > 20))` or `filter(mpg <= 20)`).
:::
### Set operators
What set operators can be used with `select()` and how?
::: {.callout-note collapse="true" title="Click to show answer"}
`select()` supports set operations on columns using tidyselect helpers:
- `-` : negative selection (drop columns)\
- `c()` : combine multiple selections (union)\
- `!` : invert a selection rule\
- `&` : must meet *both* selection rules (intersection)\
- `|` : can meet *either* selection rule (union)
Example:
``` r
select(dat, contains("scale3") & contains("item1") & !ends_with("item1"))
```
:::
What set operators can be used with `filter()` and how?
::: {.callout-note collapse="true" title="Click to show answer"}
With filter(), set operations are expressed with logical operators applied to rows:
- `&` : keeps rows that satisfy all conditions (intersection)
- `|` : keeps rows that satisfy any condition (union)
- `!` : excludes rows that meet a condition (negation/difference)
Example:
```{r}
#| eval: false
#| include: true
dat %>%
filter(gender == "female" & age > 30 & !is.na(income))
```
:::
### Interactive exercises
Complete the interactive exercises for:
- [`select()`](https://errors.shinyapps.io/dplyr-learnr/#section-dplyrselect)
- [`relocate()`](https://errors.shinyapps.io/dplyr-learnr/#section-dplyrrelocate)
- [`filter()`](https://errors.shinyapps.io/dplyr-learnr/#section-dplyrfilter)
- [`mutate()`](https://errors.shinyapps.io/dplyr-learnr/#section-dplyrmutate)
- [`case_when()`](https://errors.shinyapps.io/dplyr-learnr/#section-dplyrcase_when)
### Practice data transformation using multiple functions
In your local version of this .qmd file:
- Use the 'dat_demographics_messy' to create a new data frame called 'dat_age_tidier'
- Use `disinct()` to understand what unique responses are present in the 'age' column.
- Use the other functions you learned in this chapter to clean the age column so that it contains only numbers.
- Convert the age variable to a numeric variable with `as.numeric()`.
- Print the unique responses present in the processed 'age' column as a table.
```{r}
#| include: false
```