m <- 10
mtcars %>% dplyr::mutate(disp = .data$disp * .env$m)
is equivalent to
m <- 10
mtcars %>% dplyr::mutate(disp = cur_data()$disp * .env$m)
Can you give an example where cur_data()
and .data
will yield different results?
I am told that cur_data()
and .data
are not interchangeable in all contexts.
CodePudding user response:
Here is one example taken from here which shows different results/error
library(dplyr)
library(rstatix)
data %>%
summarise(across(where(is.numeric),
~ cur_data() %>%
levene_test(reformulate("Treatment", response = cur_column())))) %>%
unclass %>%
bind_rows(.id = 'flux')
# A tibble: 3 × 5
flux df1 df2 statistic p
<chr> <int> <int> <dbl> <dbl>
1 flux1 1 8 0.410 0.540
2 flux2 1 8 2.85 0.130
3 flux3 1 8 1.11 0.323
data %>%
summarise(across(where(is.numeric),
~ .data %>%
levene_test(reformulate("Treatment", response = cur_column())))) %>%
unclass %>%
bind_rows(.id = 'flux')
Error: Problem with
summarise()
input..1
. ℹ..1 = across(...)
. ✖ cannot coerce class ‘"rlang_data_pronoun"’ to a data.frame Runrlang::last_error()
to see where the error occurred.
data
data <- data.frame(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("S1 ", "S2 ", "S3 "), class = "factor"),
plot = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L),
.Label = c(" Tree 1 ", " Tree 2 ", " Tree 3 "), class = "factor"),
Treatment = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("T1", "T2"), class = "factor"),
flux1 = c(11.52188065, 8.43156699, 4.495312274, -1.866676811, 3.861102035, -0.814742373, 6.51039536, 4.767950345, 10.36544542, 1.065963875),
flux2 = c(0.142259208, 0.04060245, 0.807631744, 0.060127596, -0.157762562, 0.062464942, 0.043147603, 0.495001652, 0.34363348, 0.134183704),
flux3 = c(0.147506197, 1.131009714, 0.038860728, 0.0176834, 0.053191593, 0.047591306, 0.00573377, -0.034926075, 0.123379247, 0.018882469))
CodePudding user response:
Within group_by .data
still includes all columns but cur_data()
excludes the group_by column(s). For example, below cur_data()[["cyl"]]
is NULL because cyl is a group by column so x does not appear in the result whereas y does appear.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(x = cur_data()[["cyl"]], y = .data[["cyl"]]) %>%
ungroup %>%
names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb" "y"
CodePudding user response:
To add to the existing answers:
In the vignette linked in the comments we find the following quote:
Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it.
It is important to understand that .data
is a special construct which is just there to help us access variables. It is neither a data.frame
nor a function
. Apart from [[
and $
most other functions won't work with .data
. Even [
won't work. Let's say we want to access more than one variable with .data
. If .data.
would be a data.frame
the following would work, but it doesn't:
library(dplyr)
mtcars %>%
transmute(new = list(.data[c("disp", "hp")]))
#> Error: Problem with `mutate()` column `new`.
#> i `new = list(.data[c("disp", "hp")])`.
#> x `[` is not supported by .data pronoun, use `[[` or $ instead.
cur_data()
on the other hand is a function which returns us the current data without grouping variables as tibble
(even if the underlying data is just a data.frame
).
In terms of speed cur_data()
has a very small overhead, compared to .data
or to just accessing the variable without prefix. Lets take a middle-sized dataset as example:
library(dplyr)
library(nycflights13)
bench::mark(iterations = 5000L,
"none" = mutate(flights, new = arr_time),
".data" = mutate(flights, new = .data$arr_time),
"cur_data()" = mutate(flights, new = cur_data()$arr_time))
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 none 1.56ms 1.69ms 578. 132.9MB 15.1
#> 2 .data 1.53ms 1.72ms 551. 24.7KB 14.4
#> 3 cur_data() 1.6ms 1.77ms 535. 33.5KB 14.7
Created on 2021-12-23 by the reprex package (v0.3.0)
In terms of interchangeability I see the following differences:
.data
can't be used as adata.frame
, this means it won't return the underyling data, unlikecur_data()
..data
can only be used to access one variable at a time by using[[
or$
, whereascur_data()
returns atibble
and will work with all functions that apply totibble
s anddata.frame
s.In terms of speed there is no big overhead using
cur_data()
, at least not for middle-sized data sets. This should be verified with bigger data with more columns..data
can be used to access grouping variables, which is not possible withcur_data()
. However,cur_data_all()
is a similar function which also returns the current data, but including the grouping variables. This later function should be completely interchangeable with.data
, at least I cannot come up with a case where it's not possible to use both.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
transmute(x = cur_data()[["cyl"]],
y = .data[["cyl"]],
z = cur_data_all()[["cyl"]])
#> # A tibble: 32 x 3
#> # Groups: cyl [3]
#> cyl y z
#> <dbl> <dbl> <dbl>
#> 1 6 6 6
#> 2 6 6 6
#> 3 4 4 4
#> 4 6 6 6
#> 5 8 8 8
#> 6 6 6 6
#> 7 8 8 8
#> 8 4 4 4
#> 9 4 4 4
#> 10 6 6 6
#> # … with 22 more rows