difference between .data and cur

m <- 10
mtcars %>% dplyr::mutate(disp = .data$disp * .env$m)

is equivalent to

m <- 10
mtcars %>% dplyr::mutate(disp = cur_data()$disp * .env$m)

Can you give an example where cur_data() and .data will yield different results?

I am told that cur_data() and .data are not interchangeable in all contexts.

CodePudding user response：

Here is one example taken from here which shows different results/error

library(dplyr)
library(rstatix)
data %>%
     summarise(across(where(is.numeric),
      ~  cur_data() %>%
       levene_test(reformulate("Treatment", response = cur_column())))) %>%
    unclass %>% 
     bind_rows(.id = 'flux')
# A tibble: 3 × 5
  flux    df1   df2 statistic     p
  <chr> <int> <int>     <dbl> <dbl>
1 flux1     1     8     0.410 0.540
2 flux2     1     8     2.85  0.130
3 flux3     1     8     1.11  0.323
data %>%
     summarise(across(where(is.numeric),
      ~  .data %>%
       levene_test(reformulate("Treatment", response = cur_column())))) %>%
     unclass %>% 
     bind_rows(.id = 'flux')

Error: Problem with summarise() input ..1. ℹ ..1 = across(...). ✖ cannot coerce class ‘"rlang_data_pronoun"’ to a data.frame Run rlang::last_error() to see where the error occurred.

data

data <- data.frame(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), 
                                    .Label = c("S1 ", "S2 ", "S3 "), class = "factor"), 
                   plot = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), 
                                    .Label = c(" Tree 1 ", " Tree 2 ", " Tree 3 "), class = "factor"), 
                   Treatment = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("T1", "T2"), class = "factor"), 
                   flux1 = c(11.52188065, 8.43156699, 4.495312274, -1.866676811, 3.861102035, -0.814742373, 6.51039536, 4.767950345, 10.36544542, 1.065963875), 
                   flux2 = c(0.142259208, 0.04060245, 0.807631744, 0.060127596, -0.157762562, 0.062464942, 0.043147603, 0.495001652, 0.34363348, 0.134183704), 
                   flux3 = c(0.147506197, 1.131009714, 0.038860728, 0.0176834, 0.053191593, 0.047591306, 0.00573377, -0.034926075, 0.123379247, 0.018882469))

CodePudding user response：

Within group_by .data still includes all columns but cur_data() excludes the group_by column(s). For example, below cur_data()[["cyl"]] is NULL because cyl is a group by column so x does not appear in the result whereas y does appear.

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  mutate(x = cur_data()[["cyl"]], y = .data[["cyl"]]) %>%
  ungroup %>%
  names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb" "y"

CodePudding user response：

To add to the existing answers:

In the vignette linked in the comments we find the following quote:

Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it.

It is important to understand that .data is a special construct which is just there to help us access variables. It is neither a data.frame nor a function. Apart from [[ and $ most other functions won't work with .data. Even [ won't work. Let's say we want to access more than one variable with .data. If .data. would be a data.frame the following would work, but it doesn't:

library(dplyr)

mtcars %>% 
  transmute(new = list(.data[c("disp", "hp")]))
#> Error: Problem with `mutate()` column `new`.
#> i `new = list(.data[c("disp", "hp")])`.
#> x `[` is not supported by .data pronoun, use `[[` or $ instead.

cur_data() on the other hand is a function which returns us the current data without grouping variables as tibble (even if the underlying data is just a data.frame).

In terms of speed cur_data() has a very small overhead, compared to .data or to just accessing the variable without prefix. Lets take a middle-sized dataset as example:

library(dplyr)
library(nycflights13)

bench::mark(iterations = 5000L,
            "none" = mutate(flights, new = arr_time),
            ".data" = mutate(flights, new = .data$arr_time),
            "cur_data()" = mutate(flights, new = cur_data()$arr_time))

#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 none         1.56ms   1.69ms      578.   132.9MB     15.1
#> 2 .data        1.53ms   1.72ms      551.    24.7KB     14.4
#> 3 cur_data()    1.6ms   1.77ms      535.    33.5KB     14.7

^{Created on 2021-12-23 by the reprex package (v0.3.0)}

In terms of interchangeability I see the following differences:

.data can't be used as a data.frame, this means it won't return the underyling data, unlike cur_data().
.data can only be used to access one variable at a time by using [[ or $, whereas cur_data() returns a tibble and will work with all functions that apply to tibbles and data.frames.
In terms of speed there is no big overhead using cur_data(), at least not for middle-sized data sets. This should be verified with bigger data with more columns.
.data can be used to access grouping variables, which is not possible with cur_data(). However, cur_data_all() is a similar function which also returns the current data, but including the grouping variables. This later function should be completely interchangeable with .data, at least I cannot come up with a case where it's not possible to use both.

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  transmute(x = cur_data()[["cyl"]],
            y = .data[["cyl"]],
            z = cur_data_all()[["cyl"]]) 

#> # A tibble: 32 x 3
#> # Groups:   cyl [3]
#>      cyl     y     z
#>    <dbl> <dbl> <dbl>
#>  1     6     6     6
#>  2     6     6     6
#>  3     4     4     4
#>  4     6     6     6
#>  5     8     8     8
#>  6     6     6     6
#>  7     8     8     8
#>  8     4     4     4
#>  9     4     4     4
#> 10     6     6     6
#> # … with 22 more rows