Home > Software design >  How can I use dplyr group_by within a function, conditional on a flag?
How can I use dplyr group_by within a function, conditional on a flag?

Time:07-08

I want to define a custom function which groups and summarises some data using dplyr, and conditional on a Boolean flag can group by an additional level. I can achieve this using a full if... else control block as in this trivial example:

library(tidyverse)
data(Titanic)

Titanic <- as_tibble(Titanic)

foo <- function(by_age = FALSE) {
  if (by_age) {
    bar <- Titanic %>%
      group_by(Survived, Age)
  } else {
    bar <- Titanic %>%
      group_by(Survived)
  }
  
  bar %>%
    summarise(n = sum(n))
}

foo()
foo(by_age = TRUE)

But this seems a very clumsy way round. Is there a way I can achieve this with a single block of dplyr code, conditionally calling Age as a second grouping variable? I've tried with ifelse(by_age, Age, NA) in my group_by statement, and some of the techniques listed in this SO post but to no avail.

CodePudding user response:

Edit

Sorry, I didn't read your linked SO post; if you want to avoid the ... approach for some reason, this is one potential solution:

library(tidyverse)
data(Titanic)

Titanic <- as_tibble(Titanic)

foo <- function(by_age = FALSE) {
  Titanic %>%
    group_by(Survived, if(by_age) Age) %>%
    summarise(n = sum(n))
}

foo()
#> # A tibble: 2 × 2
#>   Survived     n
#>   <chr>    <dbl>
#> 1 No        1490
#> 2 Yes        711
foo(by_age = TRUE)
#> `summarise()` has grouped output by 'Survived'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Survived [2]
#>   Survived `if (by_age) Age`     n
#>   <chr>    <chr>             <dbl>
#> 1 No       Adult              1438
#> 2 No       Child                52
#> 3 Yes      Adult               654
#> 4 Yes      Child                57

Created on 2022-07-07 by the reprex package (v2.0.1)

To avoid the "Age" column being called "if (by_age) Age" you can use:

library(tidyverse)
data(Titanic)

Titanic <- as_tibble(Titanic)

foo <- function(by_age = FALSE) {
  Titanic %>%
    group_by(Survived, !!sym(ifelse(by_age, "Age", ""))) %>%
    summarise(n = sum(n))
}

foo()
#> # A tibble: 2 × 2
#>   Survived     n
#>   <chr>    <dbl>
#> 1 No        1490
#> 2 Yes        711
foo(by_age = TRUE)
#> `summarise()` has grouped output by 'Survived'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Survived [2]
#>   Survived Age       n
#>   <chr>    <chr> <dbl>
#> 1 No       Adult  1438
#> 2 No       Child    52
#> 3 Yes      Adult   654
#> 4 Yes      Child    57

Created on 2022-07-07 by the reprex package (v2.0.1)

Original answer

One solution is to use ... (dot-dot-dot) to pass in the argument if/when you want, e.g.

library(tidyverse)
data(Titanic)

Titanic <- as_tibble(Titanic)

foo <- function(...) {
  Titanic %>%
      group_by(Survived, ...) %>%
    summarise(n = sum(n))
}

foo()
#> # A tibble: 2 × 2
#>   Survived     n
#>   <chr>    <dbl>
#> 1 No        1490
#> 2 Yes        711
foo(Age)
#> `summarise()` has grouped output by 'Survived'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Survived [2]
#>   Survived Age       n
#>   <chr>    <chr> <dbl>
#> 1 No       Adult  1438
#> 2 No       Child    52
#> 3 Yes      Adult   654
#> 4 Yes      Child    57

# You can also pass in multiple 'extra' arguments
foo(Age, Sex)
#> `summarise()` has grouped output by 'Survived', 'Age'. You can override using
#> the `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups:   Survived, Age [4]
#>   Survived Age   Sex        n
#>   <chr>    <chr> <chr>  <dbl>
#> 1 No       Adult Female   109
#> 2 No       Adult Male    1329
#> 3 No       Child Female    17
#> 4 No       Child Male      35
#> 5 Yes      Adult Female   316
#> 6 Yes      Adult Male     338
#> 7 Yes      Child Female    28
#> 8 Yes      Child Male      29

Created on 2022-07-07 by the reprex package (v2.0.1)

NB: Using ... comes with two downsides:

  • When you use it to pass arguments to another function, you have to carefully explain to the user where those arguments go. This makes it hard to understand what you can do with functions like lapply() and plot().
  • A misspelled argument will not raise an error. This makes it easy for typos to go unnoticed (from Advanced R; https://adv-r.hadley.nz/functions.html?q=...#fun-dot-dot-dot)

CodePudding user response:

You can do using curly-curly ({{}}) from rlang package and pass the additional group variable as NULL

library(dplyr)
library(rlang)

data(Titanic)

Titanic <- as_tibble(Titanic)

foo <- function(grp = NULL) {
  Titanic %>%
    group_by(Survived, {{grp}}) %>%
    summarise(n = sum(n))
}

foo()
#> # A tibble: 2 × 2
#>   Survived     n
#>   <chr>    <dbl>
#> 1 No        1490
#> 2 Yes        711

foo(Age)
#> `summarise()` has grouped output by 'Survived'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Survived [2]
#>   Survived Age       n
#>   <chr>    <chr> <dbl>
#> 1 No       Adult  1438
#> 2 No       Child    52
#> 3 Yes      Adult   654
#> 4 Yes      Child    57

Created on 2022-07-07 by the reprex package (v2.0.1)

CodePudding user response:

One approach is to split the group_by into two group_by statements.

foo <- function(by_age = FALSE) {
  Titanic %>%
    group_by(Survived) %>%
    { if (by_age) group_by(., Age, .add = TRUE) else . } %>%
    summarise(n = sum(n), .groups = "drop")
}

giving:

foo()
## # A tibble: 2 x 2
##   Survived     n
##   <chr>    <dbl>
## 1 No        1490
## 2 Yes        711

foo(TRUE)
## # A tibble: 4 x 3
##   Survived Age       n
##   <chr>    <chr> <dbl>
## 1 No       Adult  1438
## 2 No       Child    52
## 3 Yes      Adult   654
## 4 Yes      Child    57
  • Related