10 Data types

10.1 Data types

integers, numeric/floats, factors, strings, booleans/logicals, NA.

10.2 Working with numeric/floats

floating point math is weird…

near()

10.2.1 `near()`

testing equivalence with floats is weird

sqrt(2) ^ 2 == 2 near(sqrt(2) ^ 2, 2)

[1] 2

sqrt(2)^2

[1] 2

2 == sqrt(2)^2

[1] FALSE

10.2.2 `between()`

TODO

10.2.3 Rounding: `round()` probably doesn’t do what you think

It is extremely common to round statistical results before including them in text and tables.

However, did you know that R doesn’t use the rounding method most of us are taught in school where .5 is rounded up to the next integer? Instead it uses “banker’s rounding”, which is better when you round a very large number of numbers, but worse for reporting the results of specific analyses.

This is easier to show than explain. The round() function rounds each of the numbers passed to it. What do you expect the output to be?

round(c(0.5, 
        1.5, 
        2.5, 
        3.5, 
        4.5, 
        5.5), digits = 0)

Click to show result

round(c(0.5, 
        1.5, 
        2.5, 
        3.5, 
        4.5, 
        5.5))

[1] 0 2 2 4 4 6

Why is this? Because R’s round() function uses “banker’s rounding, which rounds 5s based on whether the preceding digit is odd or even. This is a good thing in many contexts like accounting, but it’s usually not what we want or expect when rounding specific statistical results for inclusion in a report or manuscript.

In most of your R scripts, you should instead use the {roundwork} package’s round_up(), written by Lukas Jung, which produces the round-.5-upwards behavior most of us expect.

library(roundwork) 

roundwork::round_up(c(0.5, 
                      1.5, 
                      2.5, 
                      3.5, 
                      4.5, 
                      5.5))

[1] 1 2 3 4 5 6

These will typically be used inside a pipe workflow:

dat_regression_betas_rounded <- dat_regression_betas %>%
  mutate(beta_estimate = round_up(beta_estimate, 2),
         beta_ci_lower = round_up(beta_ci_lower, 2),
         beta_ci_upper = round_up(beta_ci_upper, 2)) 

dat_regression_betas_rounded %>%
  kable() %>%
  kable_classic(full_width = FALSE)

beta_estimate	beta_ci_lower	beta_ci_upper	p
0.37	0.17	0.57	0.0009180
0.30	0.10	0.50	0.0000014
0.12	-0.08	0.32	0.0082030
0.29	0.09	0.49	0.0014797
0.18	-0.02	0.38	0.0043528

10.2.4 ‘rounding’ of p-values using APA style

The one thing that psychologists don’t round using the round-half-up rule is p-values. These are instead usually truncated using the APA style guide’s conventions so that p values smaller than .001 are reported as “< .001”.

# install.packages("devtools"); devtools::install_github("ianhussey/truffle")
library(truffle)

dat_regression_betas_rounded <- dat_regression_betas %>%
  mutate(beta_estimate = round_up(beta_estimate, 2),
         beta_ci_lower = round_up(beta_ci_lower, 2),
         beta_ci_upper = round_up(beta_ci_upper, 2),
         p = round_p_value(p)) 

dat_regression_betas_rounded %>%
  kable(align = 'r') %>%
  kable_classic(full_width = FALSE)

beta_estimate	beta_ci_lower	beta_ci_upper	p
0.37	0.17	0.57	< .001
0.30	0.10	0.50	< .001
0.12	-0.08	0.32	.008
0.29	0.09	0.49	.001
0.18	-0.02	0.38	.004

10.3 Working with strings

10.3.1 Case conversion

str_to_lower str_to_upper str_to_sentence str_to_title

10.3.2 Substring searches

str_detect() + ignore case str_starts str_ends

str_locate() str_locate_all()

10.3.3 Removal

str_remove() str_remove_all()

str_squish #remove whitespace

10.3.4 Replacement

str_replace str_replace_all

10.3.5 Seperation

str_split, relationship with seperate()

10.3.6 Extraction

word()

10.3.7 Regex

TODO

10.4 Working with factors

forcats with plot examples, regression examples

converting numeric to factor via character or whatever that weird thing is

fct_rev

fct_order - contrast with arrange() fct_relevel

fct_drop

fct_lump

# Data types ```{r} #| include: false # settings, placed in a chunk that will not show in the .html file (because include=FALSE) # disables scientific notation so that small numbers appear as eg "0.00001" rather than "1e-05" options(scipen = 999) ``` ## Data types - integers, numeric/floats, factors, strings, booleans/logicals, NA. ## Working with numeric/floats floating point math is weird... near() ### `near()` testing equivalence with floats is weird sqrt(2) \^ 2 == 2 near(sqrt(2) \^ 2, 2) ```{r} 2 sqrt(2)^2 2 == sqrt(2)^2 ``` ### `between()` TODO ### Rounding: `round()` probably doesn't do what you think It is extremely common to round statistical results before including them in text and tables. However, did you know that R doesn't use the rounding method most of us are taught in school where .5 is rounded up to the next integer? Instead it uses "banker's rounding", which is better when you round a very large number of numbers, but worse for reporting the results of specific analyses. This is easier to show than explain. The `round()` function rounds each of the numbers passed to it. What do you expect the output to be? ```{r} #| eval: false round(c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5), digits = 0) ``` ::: {.callout-note collapse="true" title="Click to show result"} ```{r} round(c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5)) ``` Why is this? Because R's `round()` function uses "banker's rounding, which rounds 5s based on whether the preceding digit is odd or even. This is a good thing in many contexts like accounting, but it's usually not what we want or expect when rounding specific statistical results for inclusion in a report or manuscript. ::: In most of your R scripts, you should instead use the {roundwork} package's `round_up()`, written by [Lukas Jung](https://bsky.app/profile/lhdjung.bsky.social), which produces the round-.5-upwards behavior most of us expect. ```{r} library(roundwork) roundwork::round_up(c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5)) ``` These will typically be used inside a pipe workflow: ```{r} #| eval: true #| include: false # make up some values to be rounded library(dplyr) library(knitr) library(kableExtra) set.seed(44) dat_regression_betas <- data.frame(beta_estimate = rnorm(n = 5, mean = .3, sd = .1)) %>% mutate(beta_ci_lower = beta_estimate - 0.2, beta_ci_upper = beta_estimate + 0.2) %>% mutate(p = runif(n = 5, min = 0.000000001, max = 0.01)) ``` ```{r} dat_regression_betas_rounded <- dat_regression_betas %>% mutate(beta_estimate = round_up(beta_estimate, 2), beta_ci_lower = round_up(beta_ci_lower, 2), beta_ci_upper = round_up(beta_ci_upper, 2)) dat_regression_betas_rounded %>% kable() %>% kable_classic(full_width = FALSE) ``` ### 'rounding' of p-values using APA style The one thing that psychologists don't round using the round-half-up rule is *p*-values. These are instead usually truncated using the APA style guide's conventions so that p values smaller than .001 are reported as "\< .001". ```{r} # install.packages("devtools"); devtools::install_github("ianhussey/truffle") library(truffle) dat_regression_betas_rounded <- dat_regression_betas %>% mutate(beta_estimate = round_up(beta_estimate, 2), beta_ci_lower = round_up(beta_ci_lower, 2), beta_ci_upper = round_up(beta_ci_upper, 2), p = round_p_value(p)) dat_regression_betas_rounded %>% kable(align = 'r') %>% kable_classic(full_width = FALSE) ``` ## Working with strings ### Case conversion str_to_lower str_to_upper str_to_sentence str_to_title ### Substring searches str_detect() + ignore case str_starts str_ends str_locate() str_locate_all() ### Removal str_remove() str_remove_all() str_squish #remove whitespace ### Replacement str_replace str_replace_all ### Seperation str_split, relationship with seperate() ### Extraction word() ### Regex TODO ## Working with factors forcats with plot examples, regression examples converting numeric to factor via character or whatever that weird thing is fct_rev fct_order - contrast with arrange() fct_relevel fct_drop fct_lump