Home > database >  Group by dynamically generated column with pattern in `dplyr`
Group by dynamically generated column with pattern in `dplyr`

Time:10-06

I am trying to group by a column that is dynamically generated inside previous steps in my function. The column will have a pattern (suffix). I could use dplyr::contains(), but it is not the most accurate way, since I only have one column and could potentially create nasty bugs if other columns with suffix appear in the future.

fun <- function(df, target){
  # previous steps generate a dynamic columns
  # dynamic columns have suffixes
  # let's assume I only care about Length
  df %>% 
    group_by("{target}.Length")
  # do more stuff
}

I am using dplyr == 1.0.8 and thought that I this type of thing was supported. I am probably missing something really simple, and "Petal" does not get pasted properly and interpreted as a proper column name. I tried several combinations of {} and {{}} and it is not working as I would expect. For example:

> fun(iris, "Petal") %>% head()
# A tibble: 6 × 6
# Groups:   "{target}.Length" [1]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species `"{target}.Length"`
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>              
1          5.1         3.5          1.4         0.2 setosa  {target}.Length    
2          4.9         3            1.4         0.2 setosa  {target}.Length    
3          4.7         3.2          1.3         0.2 setosa  {target}.Length    
4          4.6         3.1          1.5         0.2 setosa  {target}.Length    
5          5           3.6          1.4         0.2 setosa  {target}.Length    
6          5.4         3.9          1.7         0.4 setosa  {target}.Length 

I am using the same type of notation for mutate() successfully, but it doesn't for group_by().

CodePudding user response:

While I use this kind of glue pattern quite often for assignments inside mutate and summarise I haven't seen any examples of using it in other dplyr verbs. However, one option would be use the .data pronoun and paste0 which is not that elegant but works:

library(dplyr, warn=FALSE)

fun <- function(df, target){
  df %>% 
    group_by(.data[[paste0(target, ".Length")]])
}

fun(iris, "Petal") %>% head()
#> # A tibble: 6 × 5
#> # Groups:   Petal.Length [4]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> 6          5.4         3.9          1.7         0.4 setosa

CodePudding user response:

We can use !! sym(...) on a string to evaluate it early as object (name).

We can also use the .data pronoun, but when using it inside a package you should import it to avoid R CMD check notes (see here).

library(dplyr)

fun <- function(df, target){
  df %>% 
    group_by(!! sym(paste0(target,".Length")))
}

fun(iris, "Petal") %>% head()

#> # A tibble: 6 x 5
#> # Groups:   Petal.Length [4]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> 6          5.4         3.9          1.7         0.4 setosa

Why does this work?

Lets have a look at it step by step when building our function above:

fun1 <- function(df){
  df %>% 
    group_by("Sepal.Length")
}

⬇️ groups by new variable which is named after its value / call

iris %>% 
  fun1()
#> # A tibble: 150 x 6
#> # Groups:   "Sepal.Length" [1]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species `"Sepal.Length"`
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>           
#>  1          5.1         3.5          1.4         0.2 setosa  Sepal.Length    
#>  2          4.9         3            1.4         0.2 setosa  Sepal.Length    
#>  3          4.7         3.2          1.3         0.2 setosa  Sepal.Length    
#>  4          4.6         3.1          1.5         0.2 setosa  Sepal.Length    
#>  5          5           3.6          1.4         0.2 setosa  Sepal.Length    
#>  6          5.4         3.9          1.7         0.4 setosa  Sepal.Length    
#>  7          4.6         3.4          1.4         0.3 setosa  Sepal.Length    
#>  8          5           3.4          1.5         0.2 setosa  Sepal.Length    
#>  9          4.4         2.9          1.4         0.2 setosa  Sepal.Length    
#> 10          4.9         3.1          1.5         0.1 setosa  Sepal.Length    
#> # ... with 140 more rows

fun2 <- function(df){
  df %>% 
    group_by(as.name("Sepal.Length"))
}

⬇️ error: group_by needs a vetor as input, but has received a name / symbol

iris %>% 
  fun2()
#> Error in `add_computed_columns()`:
#> ! Problem adding computed columns in `group_by()`.
#> x Problem with `mutate()` input `..1`.
#> i `..1 = as.name("Sepal.Length")`.
#> x `..1` must be a vector, not a symbol.
#> Caused by error in `mutate_cols()`:
#> ! Problem with `mutate()` input `..1`.
#> i `..1 = as.name("Sepal.Length")`.
#> x `..1` must be a vector, not a symbol.



fun3 <- function(df){
  df %>% 
    group_by(eval(as.name("Sepal.Length")))
}

⬇️ works, but creates new column which is named after call to eval:

iris %>% 
  fun3()
#> # A tibble: 150 x 6
#> # Groups:   eval(as.name("Sepal.Length")) [35]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species `eval(as.name("Sep~
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>                 <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa                  5.1
#>  2          4.9         3            1.4         0.2 setosa                  4.9
#>  3          4.7         3.2          1.3         0.2 setosa                  4.7
#>  4          4.6         3.1          1.5         0.2 setosa                  4.6
#>  5          5           3.6          1.4         0.2 setosa                  5  
#>  6          5.4         3.9          1.7         0.4 setosa                  5.4
#>  7          4.6         3.4          1.4         0.3 setosa                  4.6
#>  8          5           3.4          1.5         0.2 setosa                  5  
#>  9          4.4         2.9          1.4         0.2 setosa                  4.4
#> 10          4.9         3.1          1.5         0.1 setosa                  4.9
#> # ... with 140 more rows


fun4 <- function(df){
  df %>% 
    group_by(!! as.name("Sepal.Length"))
}

⬇️ works: !! does the same as eval, but works only inside rlang powered functions and it is evaluated faster than the normal eval. For this reason group_by correctly gets the name of the column to evaluate.

iris %>% 
  fun4()
#> # A tibble: 150 x 5
#> # Groups:   Sepal.Length [35]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

fun5 <- function(df){
  df %>% 
    group_by(!! sym("Sepal.Length"))
}

works: sym is almost simmilar to base R’s as.name, but it's shorter.

iris %>% 
  fun5()

#> # A tibble: 150 x 5
#> # Groups:   Sepal.Length [35]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

Created on 2022-10-06 by the reprex package (v2.0.1)

  • Related