Given a data.frame:
tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
value = runif(12),
n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )
# A tibble: 12 x 3
group value n_slice
<chr> <dbl> <dbl>
1 A 0.853 2
2 A 0.726 2
3 A 0.783 2
4 A 0.0426 2
5 B 0.320 1
6 B 0.683 1
7 B 0.732 1
8 B 0.319 1
9 C 0.118 3
10 C 0.0259 3
11 C 0.818 3
12 C 0.635 3
I'd like to slice by group with diferent number of rows in each group
I tried the code below but I get notified that "n" must be a constant
re %>%
group_by(group) %>%
slice_max(value, n = n_slice)
Erro: `n` must be a constant in `slice_max()`.
Expected output:
group value n_slice
<chr> <dbl> <dbl>
1 A 0.853 2
2 A 0.783 2
3 B 0.732 1
4 C 0.818 3
5 C 0.635 3
6 C 0.118 3
CodePudding user response:
In this case, an option is with group_modify
library(dplyr)
re %>%
group_by(group) %>%
group_modify(~ .x %>%
slice_max(value, n = first(.x$n_slice))) %>%
ungroup
-output
# A tibble: 6 × 3
group value n_slice
<chr> <dbl> <dbl>
1 A 0.931 2
2 A 0.931 2
3 B 0.722 1
4 C 0.591 3
5 C 0.519 3
6 C 0.494 3
Or another option is to summarise
using cur_data()
and then unnest
library(tidyr)
re %>%
group_by(group) %>%
summarise(out = list(cur_data() %>%
slice_max(value, n = first(n_slice)))) %>%
unnest(out)
-output
# A tibble: 6 × 3
group value n_slice
<chr> <dbl> <dbl>
1 A 0.931 2
2 A 0.931 2
3 B 0.722 1
4 C 0.591 3
5 C 0.519 3
6 C 0.494 3
CodePudding user response:
I don't think slice_max
supports this, perhaps because it's not hard to imagine data where n_slice
is not constant within a group (that action is ambiguous). Try using filter
:
set.seed(42)
re <- tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
value = runif(12),
n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )
re
# # A tibble: 12 x 3
# group value n_slice
# <chr> <dbl> <dbl>
# 1 A 0.915 2
# 2 A 0.937 2
# 3 A 0.286 2
# 4 A 0.830 2
# 5 B 0.642 1
# 6 B 0.519 1
# 7 B 0.737 1
# 8 B 0.135 1
# 9 C 0.657 3
# 10 C 0.705 3
# 11 C 0.458 3
# 12 C 0.719 3
re %>%
group_by(group) %>%
filter(rank(-value) <= n_slice[1])
# # A tibble: 6 x 3
# # Groups: group [3]
# group value n_slice
# <chr> <dbl> <dbl>
# 1 A 0.915 2
# 2 A 0.937 2
# 3 B 0.737 1
# 4 C 0.657 3
# 5 C 0.705 3
# 6 C 0.719 3
Notes:
Because of the potential for ties in the data, it might be useful to use
rank(., ties.method = ...)
(see?rank
) ordplyr::dense_rank
.If the column you are slicing on does not support negation (e.g.,
Date
orPOSIXt
), then you can changerank(-value)
ton() - rank(value) 1L <= n_slice[1]
for the same effect (or more simplyn() - rank(value) < n_slice[1]
). Another option isrank(desc(value))
, thanks to @IceCreamToucan for the suggestion.
If performance is an issue (you have lots more rows), then using just filter
per group is the fastest of the answers so far :-)
bench::mark(
akrun1 = re %>% group_by(group) %>% group_modify(~ .x %>% slice_max(value, n = first(.x$n_slice))) %>% ungroup,
akrun2 = re %>% group_by(group) %>% summarise(out = list(cur_data() %>% slice_max(value, n = first(n_slice)))) %>% tidyr::unnest(out),
r2evans = re %>% group_by(group) %>% filter(rank(-value) <= n_slice[1]) %>% ungroup() %>% arrange(group, -value)
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 akrun1 7.2ms 9.05ms 108. 7.55KB 13.8 47 6 436ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~
# 2 akrun2 9.56ms 11.32ms 87.0 20KB 12.1 36 5 414ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~
# 3 r2evans 4.21ms 5.27ms 186. 4.87KB 14.3 78 6 420ms <tibble [~ <Rprofmem[,~ <bch:tm~ <tibble ~
(Adding the arrange(.)
at the end of mine to mimic the output exactly.)
If you don't have many more rows, or even if you do, readability is frankly more important. All answers produce the same results, so whichever makes more sense to the user (and the future self when you look back 6 months on old code), a little penalty in runtime is usually worth it.
CodePudding user response:
Not that this question is in need of yet another way to replicate slice_max
, but just for fun, you can use arrange
followed by slice
library(dplyr, warn.conflicts = F)
set.seed(42)
re <- tibble(group = c(rep("A", 4), rep("B", 4), rep("C", 4)),
value = runif(12),
n_slice = c(rep(2, 4), rep(1, 4), rep(3, 4)) )
re %>%
group_by(group) %>%
arrange(desc(value)) %>%
slice(seq(first(n_slice))) %>%
ungroup
#> # A tibble: 6 × 3
#> group value n_slice
#> <chr> <dbl> <dbl>
#> 1 A 0.937 2
#> 2 A 0.915 2
#> 3 B 0.737 1
#> 4 C 0.719 3
#> 5 C 0.705 3
#> 6 C 0.657 3
Created on 2021-12-17 by the reprex package (v2.0.1)
This, surprisingly, seems a little faster
library(bench)
library(dplyr, warn.conflicts = F)
set.seed(42)
n <- 1e5
re <- tibble(group = c(rep("A", n), rep("B", n), rep("C", n)),
value = runif(n*3),
n_slice = c(rep(sample(n, 1), n), rep(sample(n, 1), n), rep(sample(n, 1), n)) )
bench::mark(
akrun1 = re %>% group_by(group) %>% group_modify(~ .x %>% slice_max(value, n = first(.x$n_slice))) %>% ungroup,
akrun2 = re %>% group_by(group) %>% summarise(out = list(cur_data() %>% slice_max(value, n = first(n_slice)))) %>% tidyr::unnest(out),
r2evans = re %>% group_by(group) %>% filter(rank(-value) <= n_slice[1]) %>% ungroup() %>% arrange(group, -value),
arrange = re %>% group_by(group) %>% arrange(desc(value)) %>% slice(seq(first(n_slice))) %>% ungroup %>% arrange(group, -value)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 akrun1 167.7ms 169.9ms 5.71 31MB 5.71
#> 2 akrun2 166.2ms 172.4ms 5.82 29.5MB 9.70
#> 3 r2evans 173ms 175.2ms 5.67 31.8MB 3.78
#> 4 arrange 66.7ms 75.2ms 11.9 29.5MB 17.9
Created on 2021-12-17 by the reprex package (v2.0.1)