Here is my session in R:
bash$ R
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(dplyr)
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> x <- select(starwars, name)
> y <- select(starwars, 'name')
> assertthat::are_equal(x, y)
[1] TRUE
>
Could you explain, please, why it's possible to address dplyr
and name
as if they were some variable names, i.e. without quotes, in the require
and select
functions?
CodePudding user response:
This is known as non-standard evaluation (NSE) and is a large part of programming (and meta-programming) with R. There are many Q&A's on Stack Overflow about specific aspects of NSE, but I can't find any that cover the concept broadly enough to answer your question, so will venture a brief explanation here.
I should preface this by saying that the subject is treated more authoratatively in other places, such as the Programming with dplyr vignette, and the Non-Standard Evaluation chapter of "Advanced R" by Hadley Wickham.
The concept of NSE relies on the fact that expressions in R undergo lazy evaluation. That is, the actual object in memory that a name refers to is only retrieved when R needs it. For example, let's define a function that takes an argument which it doesn't actually use:
print_hello <- function(unused_variable) {
print('hello')
}
Now, if we do:
print_hello(bananas)
#> [1] "hello"
The function runs without a problem. The object bananas
doesn't exist, but R never had to use it, so it didn't bother to check whether it existed. The symbolic name bananas only ever existed within a promise passed to the code within the function. A promise in this context is the unevaluated name bananas
plus the calling environment, in which the object with that name can be retrieved if needed.
Of course, if the name doesn't exist in the search path when R comes to use it, we will get an error at that point:
print_hello2 <- function(unused_variable) {
print(unused_variable)
print('hello')
}
print_hello2(bananas)
#> Error in print(unused_variable) : object 'bananas' not found
Within the body of this function, R needs to print the object bananas
, but when it looks it up, the object doesn't exist, so R throws an error.
The idea of non-standard evaluation is that we can essentially hijack the promise object before it is evaluated, and perform useful operations on it. For example, suppose we want to take the variable name and put it in a string, whether the variable exists or not. We can capture the name without evaluating it using substitute
, and convert it into a string using deparse
, which allows
compare_to_apples <- function(fruit) {
paste('I prefer', deparse(substitute(fruit)), 'to apples')
}
compare_to_apples(bananas)
#> [1] "I prefer bananas to apples"
Although this example isn't very useful, we can write functions that make the end-user's life a bit easier by removing the need for them to quote column names within a function. This makes for easier-to-write and easier-to-read code. For example, we could write a function like this:
select_one <- function(data, column) {
data[deparse(substitute(column))]
}
mtcars[1:10,] |> select_one(am)
#> am
#> Mazda RX4 1
#> Mazda RX4 Wag 1
#> Datsun 710 1
#> Hornet 4 Drive 0
#> Hornet Sportabout 0
#> Valiant 0
#> Duster 360 0
#> Merc 240D 0
#> Merc 230 0
#> Merc 280 0
The main place this is type of syntax is used in R is in functions that take unquoted data frame column names. It is especially familiar to users of the tidyverse, but it is also used a lot in base R (for example in $
, subset
, with
and within
). Base R also uses it in calls to library
and require
for package names.
The main disdvantage to NSE is ambiguity. We don't want R to be confused between variables in our global environment and in our data frame, and we sometimes want to store column names in a character vector and pass that as a method of selecting column names. For example, as an end-user, we might expect that if we did:
am <- c('gear', 'hp', 'mpg')
select_one(mtcars[1:10, ], am)
Then we would get three columns selected, but instead we get the same result as before.
Some of the complex underlying machinery of the tidyverse exists to prevent and reduce these ambiguities by ensuring that names are evaluated in the most appropriate context.
Base R's library
and require
functions take a different approach, employing a specific on/off switch for non-standard evaluation by using another parameter called character.only
. We could add a mechanism like this to our own function here like this:
select_one <- function(data, column, NSE = TRUE) {
if(NSE) data[deparse(substitute(column))] else data[column]
}
am <- c('gear', 'hp', 'mpg')
select_one(mtcars[1:10,], am, NSE = TRUE)
#> am
#> Mazda RX4 1
#> Mazda RX4 Wag 1
#> Datsun 710 1
#> Hornet 4 Drive 0
#> Hornet Sportabout 0
#> Valiant 0
#> Duster 360 0
#> Merc 240D 0
#> Merc 230 0
#> Merc 280 0
select_one(mtcars[1:10,], am, NSE = FALSE)
#> gear hp mpg
#> Mazda RX4 4 110 21.0
#> Mazda RX4 Wag 4 110 21.0
#> Datsun 710 4 93 22.8
#> Hornet 4 Drive 3 110 21.4
#> Hornet Sportabout 3 175 18.7
#> Valiant 3 105 18.1
#> Duster 360 3 245 14.3
#> Merc 240D 4 62 24.4
#> Merc 230 4 95 22.8
#> Merc 280 4 123 19.2
Another disadvantage of functions that employ NSE is that it makes them more difficult to work with inside other functions. For example, we might expect the following function to return two columns of our data frame:
select_two <- function(data, column1, column2) {
cbind(select_one(data, column1), select_one(data, column2))
}
But it doesn't:
select_two(mtcars[1:10,], am, cyl)
#> Error in `[.data.frame`(data, deparse(substitute(column))) :
#> undefined columns selected
This is because the NSE employed in select_one
causes the code to look for columns inside mtcars
called column1
and column2
, which don't exist. To use select_one
inside another function we need to take account of its NSE, for example by carefully building and evaluating any calls to it:
select_two <- function(data, column1, column2) {
call1 <- as.call(list(select_one,
data = quote(data),
column = substitute(column1)))
call2 <- as.call(list(select_one,
data = quote(data),
column = substitute(column2)))
cbind(eval(call1), eval(call2))
}
select_two(mtcars[1:10,], am, cyl)
#> am cyl
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> Valiant 0 6
#> Duster 360 0 8
#> Merc 240D 0 4
#> Merc 230 0 4
#> Merc 280 0 6
So although NSE makes the end-user experience a bit nicer, it makes programming with such functions more difficult.
Created on 2023-01-21 with reprex v2.0.2