I just ran into some weird behavior of dplyr
where summarize
kept referring to objects from a previous group.
Here is a simple reproducible example to illustrate the surprising behavior:
library(dplyr, warn.conflicts = FALSE)
tibble(x = rep(letters[1:3], times = 4),
y = rnorm(12)) %>%
group_by(x) %>%
summarize(z1 = sum(y),
z2 = {
attr(y, "test") <- "test"
sum(y)
})
#> # A tibble: 3 × 3
#> x z1 z2
#> <chr> <dbl> <dbl>
#> 1 a 0.602 0.602
#> 2 b 1.22 0.602
#> 3 c -0.310 0.602
Created on 2022-10-31 by the reprex package (v2.0.1)
I expected that z1
and z2
are identical. I don't understand why setting an attribute for the vector y
means that in later iterations, the reference to the ''correct'' elements of y
is shadowed.
The problem can be easily fixed by using sum(.data$y)
in the last line, but I would like to understand the scoping rules within the non-standard evaluation of summarize
. Any pointers to helpful documentation or explanations why the current behavior makes sense in the tidyverse non-standard evaluation framework makes sense is appreciated.
I am using R 4.1.1 with dplyr 1.0.7.
CodePudding user response:
This is a problem related to scoping. If you write to the variable y
inside summarize
, then the first grouping of your data's y
variable is copied into a local variable called y
that is distinct from the y
in your data frame. Because it is a local variable, it is found on the search path before the y
in the passed data frame. Since the same environment is used for subsequent groups' calculations inside summarize
, this local variable persists for each group.
We can see this if we do:
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
tibble(x = rep(letters[1:3], times = 4),
y = rnorm(12)) %>%
group_by(x) %>%
summarize(z1 = sum(y),
z2 = {
y <- y
sum(y)
})
#> # A tibble: 3 x 3
#> x z1 z2
#> <chr> <dbl> <dbl>
#> 1 a 1.15 1.15
#> 2 b 2.76 1.15
#> 3 c -0.690 1.15
As long as we remove the local copy of the y
variable from the local frame, this doesn't happen:
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
tibble(x = rep(letters[1:3], times = 4),
y = rnorm(12)) %>%
group_by(x) %>%
summarize(z1 = sum(y),
z2 = {
attr(y, "test") <- "test"
x <- sum(y)
rm(y)
x
})
#> # A tibble: 3 x 3
#> x z1 z2
#> <chr> <dbl> <dbl>
#> 1 a 1.15 1.15
#> 2 b 2.76 2.76
#> 3 c -0.690 -0.690
Or better still, don't write to a local variable with the same name as a variable in your data frame:
tibble(x = rep(letters[1:3], times = 4),
y = rnorm(12)) %>%
group_by(x) %>%
summarize(z1 = sum(y),
z2 = {
new_y <- y
attr(new_y, "test") <- "test"
sum(new_y)
})
#> # A tibble: 3 x 3
#> x z1 z2
#> <chr> <dbl> <dbl>
#> 1 a 1.15 1.15
#> 2 b 2.76 2.76
#> 3 c -0.690 -0.690
Created on 2022-10-31 with reprex v2.0.2