Can you help me interpret this code? I am specifically confused about three arguments inside if_else: runif (n()) < 0.1, NA_character_, as.character(cut).
diamonds %>%
mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
ggplot()
geom_bar(mapping = aes(x = cut)).
source: R for Data Science
CodePudding user response:
I'll assume you understand everything outside of the contents of the mutate
call. As others have suggested in the comments, you can find documentation for any of these functions using the ?function
syntax.
dplyr::mutate()
is being used here to add a new column, "cut", to the diamonds dataframe, which will replace the old "cut" column:
cut = ifelse(runif(n)) < 0.1, NA_character_, as.character(cut))
ifelse()
ifelse
is function that requires three arguments: The first is a conditional ("test"), the second is the value to return if the conditional is true ("yes"), and the third is the value to return if the conditional is false ("no"). Its main advantage over a standard 'if statement' is that it can be vectorised. For example:
ifelse(test = c(1,2,3) < 3, yes = "less than three", no = "more than two")
# [1] "less than three" "less than three" "more than two"
runif()
stats::runif()
is a function that generates random numbers between default values of 0 and 1. "runif" is short for "random uniform (number)". Its first argument, "n" is the number of numbers to generate. For example:
## set random seed for reproducible results
set.seed(1)
## generate 5 random numbers
runif(5)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
n()
dplyr::n()
is a function that can only be used within calls to mutate()
, summarise()
and filter()
. It returns the number of observations within the current group. Assuming that your data is ungrouped, this will be equivalent to nrow(diamonds)
NA_character_
It's not obvious, but there are different types of NA value within R. NA values are normally coerced to the correct type, but in some operations (presumably including this one) it is necessary to specify the type of NA that is required. NA_character_
just means a missing character value. Other, similar reserved names in R include NA_integer_
and NA_real_
.
as.character(cut)
The "cut" data within the diamonds data frame is an ordered factor with five levels. The values of ordered factors are actually integers, each of which pertains to a string stored within the levels
attribute of the factor. as.character
is a generic function, which means it does slightly different things depending on its input. When the input of as.character
is a factor, as.character
returns the levels of the factor as a character vector. This sounds complicated, but in practise it's very intuitive:
my.factor <- factor(c("level 1", "level 2", "level 3", "level 2"))
## implicitly calling `print.factor`
my.factor
# [1] level 1 level 2 level 3 level 2
# Levels: level 1 level 2 level 3
## peeking under the hood
unclass(my.factor)
# [1] 1 2 3 2
# attr(,"levels")
# [1] "level 1" "level 2" "level 3"
## `as.character` returns the levels pertaining to each element
as.character(my.factor)
# [1] "level 1" "level 2" "level 3" "level 2"
Putting it all together
The call to ifelse
achieves the following:
Generate a vector of random numbers between zero and one whose length is equivalent to the number of rows in the 'diamonds' dataframe. For each of these random numbers, do the following: If the random number is less than 0.1, return a missing character value (NA_character_). Otherwise, return the level-name of the corresponding element of
diamonds$cut
.
The call to mutate simply overwrites the previous diamonds$cut (used in the calculation) with this new character vector.