Home > Mobile >  how to use a variable as a parameter of the dplyr::slice_max() function in R
how to use a variable as a parameter of the dplyr::slice_max() function in R

Time:12-18

Given a data.frame:

tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
        value = runif(12),
        n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )

# A tibble: 12 x 3
   group  value n_slice
   <chr>  <dbl>   <dbl>
 1 A     0.853        2
 2 A     0.726        2
 3 A     0.783        2
 4 A     0.0426       2
 5 B     0.320        1
 6 B     0.683        1
 7 B     0.732        1
 8 B     0.319        1
 9 C     0.118        3
10 C     0.0259       3
11 C     0.818        3
12 C     0.635        3

I'd like to slice by group with diferent number of rows in each group

I tried the code below but I get notified that "n" must be a constant

re %>% 
   group_by(group) %>% 
   slice_max(value, n = n_slice)

Erro: `n` must be a constant in `slice_max()`.

Expected output:

  group value n_slice
  <chr> <dbl>   <dbl>
1 A     0.853       2
2 A     0.783       2
3 B     0.732       1
4 C     0.818       3
5 C     0.635       3
6 C     0.118       3

CodePudding user response:

In this case, an option is with group_modify

library(dplyr)
re %>% 
   group_by(group) %>% 
   group_modify(~ .x %>%
          slice_max(value, n = first(.x$n_slice))) %>%
   ungroup

-output

# A tibble: 6 × 3
  group value n_slice
  <chr> <dbl>   <dbl>
1 A     0.931       2
2 A     0.931       2
3 B     0.722       1
4 C     0.591       3
5 C     0.519       3
6 C     0.494       3

Or another option is to summarise using cur_data() and then unnest

library(tidyr)
re %>%
    group_by(group) %>%
    summarise(out = list(cur_data() %>% 
        slice_max(value, n = first(n_slice)))) %>% 
    unnest(out)

-output

# A tibble: 6 × 3
  group value n_slice
  <chr> <dbl>   <dbl>
1 A     0.931       2
2 A     0.931       2
3 B     0.722       1
4 C     0.591       3
5 C     0.519       3
6 C     0.494       3

CodePudding user response:

I don't think slice_max supports this, perhaps because it's not hard to imagine data where n_slice is not constant within a group (that action is ambiguous). Try using filter:

set.seed(42)
re <- tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
        value = runif(12),
        n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )
re
# # A tibble: 12 x 3
#    group value n_slice
#    <chr> <dbl>   <dbl>
#  1 A     0.915       2
#  2 A     0.937       2
#  3 A     0.286       2
#  4 A     0.830       2
#  5 B     0.642       1
#  6 B     0.519       1
#  7 B     0.737       1
#  8 B     0.135       1
#  9 C     0.657       3
# 10 C     0.705       3
# 11 C     0.458       3
# 12 C     0.719       3

re %>%
  group_by(group) %>%
  filter(rank(-value) <= n_slice[1])
# # A tibble: 6 x 3
# # Groups:   group [3]
#   group value n_slice
#   <chr> <dbl>   <dbl>
# 1 A     0.915       2
# 2 A     0.937       2
# 3 B     0.737       1
# 4 C     0.657       3
# 5 C     0.705       3
# 6 C     0.719       3

Notes:

  • Because of the potential for ties in the data, it might be useful to use rank(., ties.method = ...) (see ?rank) or dplyr::dense_rank.

  • If the column you are slicing on does not support negation (e.g., Date or POSIXt), then you can change rank(-value) to n() - rank(value) 1L <= n_slice[1] for the same effect (or more simply n() - rank(value) < n_slice[1]). Another option is rank(desc(value)), thanks to @IceCreamToucan for the suggestion.


If performance is an issue (you have lots more rows), then using just filter per group is the fastest of the answers so far :-)

bench::mark(
  akrun1 = re %>% group_by(group) %>% group_modify(~ .x %>% slice_max(value, n = first(.x$n_slice))) %>% ungroup,
  akrun2 = re %>% group_by(group) %>% summarise(out = list(cur_data() %>% slice_max(value, n = first(n_slice)))) %>% tidyr::unnest(out),
  r2evans = re %>% group_by(group) %>% filter(rank(-value) <= n_slice[1]) %>% ungroup() %>% arrange(group, -value)
)
# # A tibble: 3 x 13
#   expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory       time     gc       
#   <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>       <list>   <list>   
# 1 akrun1       7.2ms   9.05ms     108.     7.55KB     13.8    47     6      436ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~
# 2 akrun2      9.56ms  11.32ms      87.0      20KB     12.1    36     5      414ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~
# 3 r2evans     4.21ms   5.27ms     186.     4.87KB     14.3    78     6      420ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~

(Adding the arrange(.) at the end of mine to mimic the output exactly.)

If you don't have many more rows, or even if you do, readability is frankly more important. All answers produce the same results, so whichever makes more sense to the user (and the future self when you look back 6 months on old code), a little penalty in runtime is usually worth it.

CodePudding user response:

Not that this question is in need of yet another way to replicate slice_max, but just for fun, you can use arrange followed by slice

library(dplyr, warn.conflicts = F)
set.seed(42)
re <- tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
        value = runif(12),
        n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )

re %>% 
  group_by(group) %>% 
  arrange(desc(value)) %>% 
  slice(seq(first(n_slice))) %>% 
  ungroup
#> # A tibble: 6 × 3
#>   group value n_slice
#>   <chr> <dbl>   <dbl>
#> 1 A     0.937       2
#> 2 A     0.915       2
#> 3 B     0.737       1
#> 4 C     0.719       3
#> 5 C     0.705       3
#> 6 C     0.657       3

Created on 2021-12-17 by the reprex package (v2.0.1)

This, surprisingly, seems a little faster

library(bench) 
library(dplyr, warn.conflicts = F)

set.seed(42)
n <- 1e5
re <- tibble(group = c(rep("A", n), rep("B", n), rep("C", n)),
        value = runif(n*3),
        n_slice = c(rep(sample(n, 1), n), rep(sample(n, 1), n), rep(sample(n, 1), n)) )


bench::mark(
  akrun1 = re %>% group_by(group) %>% group_modify(~ .x %>% slice_max(value, n = first(.x$n_slice))) %>% ungroup,
  akrun2 = re %>% group_by(group) %>% summarise(out = list(cur_data() %>% slice_max(value, n = first(n_slice)))) %>% tidyr::unnest(out),
  r2evans = re %>% group_by(group) %>% filter(rank(-value) <= n_slice[1]) %>% ungroup() %>% arrange(group, -value),
  arrange = re %>% group_by(group) %>% arrange(desc(value)) %>% slice(seq(first(n_slice))) %>% ungroup %>% arrange(group, -value)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 akrun1      167.7ms  169.9ms      5.71      31MB     5.71
#> 2 akrun2      166.2ms  172.4ms      5.82    29.5MB     9.70
#> 3 r2evans       173ms  175.2ms      5.67    31.8MB     3.78
#> 4 arrange      66.7ms   75.2ms     11.9     29.5MB    17.9

Created on 2021-12-17 by the reprex package (v2.0.1)

  • Related