What effect does setting the attribute of a vector have in dplyr::summarize()?-CodePudding

I just ran into some weird behavior of dplyr where summarize kept referring to objects from a previous group.

Here is a simple reproducible example to illustrate the surprising behavior:

library(dplyr, warn.conflicts = FALSE)
tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>%
  summarize(z1 = sum(y),
            z2 = {
              attr(y, "test") <- "test"
              sum(y)
            })
#> # A tibble: 3 × 3
#>   x         z1    z2
#>   <chr>  <dbl> <dbl>
#> 1 a      0.602 0.602
#> 2 b      1.22  0.602
#> 3 c     -0.310 0.602

^{Created on 2022-10-31 by the reprex package (v2.0.1)}

I expected that z1 and z2 are identical. I don't understand why setting an attribute for the vector y means that in later iterations, the reference to the ''correct'' elements of y is shadowed.

The problem can be easily fixed by using sum(.data$y) in the last line, but I would like to understand the scoping rules within the non-standard evaluation of summarize. Any pointers to helpful documentation or explanations why the current behavior makes sense in the tidyverse non-standard evaluation framework makes sense is appreciated.

I am using R 4.1.1 with dplyr 1.0.7.

CodePudding user response：

This is a problem related to scoping. If you write to the variable y inside summarize, then the first grouping of your data's y variable is copied into a local variable called y that is distinct from the y in your data frame. Because it is a local variable, it is found on the search path before the y in the passed data frame. Since the same environment is used for subsequent groups' calculations inside summarize, this local variable persists for each group.

We can see this if we do:

library(dplyr, warn.conflicts = FALSE)

set.seed(1)

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              y <- y
              sum(y)
            }) 
#> # A tibble: 3 x 3
#>   x         z1    z2
#>   <chr>  <dbl> <dbl>
#> 1 a      1.15   1.15
#> 2 b      2.76   1.15
#> 3 c     -0.690  1.15

As long as we remove the local copy of the y variable from the local frame, this doesn't happen:

library(dplyr, warn.conflicts = FALSE)

set.seed(1)

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              attr(y, "test") <- "test"
              x <- sum(y)
              rm(y)
              x
            }) 
#> # A tibble: 3 x 3
#>   x         z1     z2
#>   <chr>  <dbl>  <dbl>
#> 1 a      1.15   1.15 
#> 2 b      2.76   2.76 
#> 3 c     -0.690 -0.690

Or better still, don't write to a local variable with the same name as a variable in your data frame:

tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>% 
  summarize(z1 = sum(y),
            z2 = {
              new_y <- y
              attr(new_y, "test") <- "test"
              sum(new_y)
            }) 
#> # A tibble: 3 x 3
#>   x         z1     z2
#>   <chr>  <dbl>  <dbl>
#> 1 a      1.15   1.15 
#> 2 b      2.76   2.76 
#> 3 c     -0.690 -0.690

^{Created on 2022-10-31 with reprex v2.0.2}