Speeding up group_by operations dplyr-CodePudding

I have a tibble with a lot of groups, and I want to do group-wise operations on it (highly simplified mutate below).

z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
            category = sample.int(2, 3000000, replace = TRUE)) %>%
  arrange(k1, category)
t1 <- z %>% 
  group_by(k1) %>%
  mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>% 
  ungroup()

This operation is very slow, but if I instead do grouping "manually", the process is hard to read, more annoying to write, but much (20x) faster.

z %>%
  mutate(x = if_else(category == 1 & lead(category) == 2 & k1 == lead(k1), "pie", "monkey"),
         x = if_else(category == 1 & k1 != lead(k1), NA_character_, x))

So clearly there is some way with keys to speed up the process. Is there a better way to do this? I tried with data.table, but it was still much slower than the manual technique.

zDT <- z %>% data.table::as.data.table()
zDT[, x := if_else(category == 1 & lead(category) == 2, "pie", "monkey"), by = "k1"]

Any advice for a natural, fast way to do this operation?

CodePudding user response：

We can speed up without having to use ifelse

library(data.table)
> system.time(setDT(z)[, x := c("monkey", "pie")[ 
     1   (category == 1 & shift(category, type = "lead") %in% 2)], by = k1])
   user  system elapsed 
 18.203   0.146  16.635 
> system.time({t1 <- z %>% 
   group_by(k1) %>%
   mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>% 
   ungroup()
 })
   user  system elapsed 
 37.319   0.321  37.523

CodePudding user response：

Doing these grouped comparisons is going to be relatively expensive. It's better to vectorize over the whole table if possible. Notice that ifelse is faster than if_else and data.table's shift is faster than lead.

library(data.table)
library(dplyr)

z <- setorder(data.table(k1 = rep(seq(1, 600000, 1), 5),
                         category = sample.int(2, 3000000, replace = TRUE)))
t1 <- copy(z)
t2 <- copy(z)
t3 <- copy(z)
t4 <- copy(z)
t5 <- copy(z)
microbenchmark::microbenchmark(
  if_else = t1[, x := if_else(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
  ifelse = t2[, x := ifelse(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
  shift = t3[, x := ifelse(category == 1L & shift(category, -1) == 2L, "pie", "monkey"), k1],
  ifelse3 = t4[, x := ifelse(category == 1L, ifelse(k1 == shift(k1, -1), ifelse(shift(category, -1) == 2L, "pie", "monkey"), NA_character_), "monkey")],
  logic = t5[, x := c("monkey", NA_character_, "monkey", "pie")[((k1 == shift(k1, -1, 0L))*((shift(category, -1, 0L) == 2)   1L)   1L)*(category == 1)   1L]],
  times = 1,
  check = "identical"
)
#> Unit: milliseconds
#>     expr        min         lq       mean     median         uq        max neval
#>  if_else 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484     1
#>   ifelse 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634     1
#>    shift  9057.7585  9057.7585  9057.7585  9057.7585  9057.7585  9057.7585     1
#>  ifelse3  1544.2912  1544.2912  1544.2912  1544.2912  1544.2912  1544.2912     1
#>    logic    81.9844    81.9844    81.9844    81.9844    81.9844    81.9844     1

The complexity in the logic is mostly due to the NA behavior. If monkey could take the place of NA, t5 could be instead:

t5[, x := c("monkey", "pie")[((k1 == shift(k1, -1, 0L))*(shift(category, -1, 0L) == 2)*(k1 == shift(k1, -1, 0L)))   1L]]

CodePudding user response：

EDIT: duckdb for the win! 10x faster than @akrun's data.table solution with the same output.

EDIT #2: small nit in OP where lead default not specified, leading to NA's that duckdb was replicating but which are treated as "monkey" in @akrun's data.table answer.

For benchmarking curiosity I looked at the duckdb and collapse packages, which both offer a version of dplyr masking / translation to a faster back-end. The collapse version was a smidge faster but duckdb was 10x as fast.

Unit: milliseconds
   expr      min       lq    mean   median       uq     max neval
 duckdb 809.5969 825.1131 851.222 845.6702 868.2173 900.495    10


Unit: seconds
     expr      min       lq     mean   median       uq      max neval
 collapse 8.363416 8.456532 8.633155 8.582542 8.835366 8.926974    10
       dt 9.211959 9.243295 9.330174 9.324183 9.433316 9.457501    10

I split the benchmarking into two parts because it looked like I couldn't have both collapse and duckdb masking dplyr at the same time.

Part 1

library(DBI); library(duckdb)
con <- dbConnect(duckdb())
duckdb_register(con, "z_duck", z)

microbenchmark::microbenchmark(times = 10,
  duckdb = tbl(con, "z_duck") |>
    group_by(k1) |>
    mutate(x = if_else(category == 1 & lead(category, default = 0) == 2, # EDIT to set default when there lead(Category) is NA at the end of a group, to match data.table answer
                       "pie", "monkey")) |>
    ungroup() |>
    collect())

Part 2 (after restarting R fresh)

library(data.table)
library(collapse)
options(collapse_mask = "all")

microbenchmark::microbenchmark(times = 5,
  collapse = z |> 
    group_by(k1) |>
    mutate(x = if_else(category == 1 & lead(category) == 2,  
                       "pie", "monkey")) |>
    ungroup() |>
    collect(),
  
  dt = setDT(z)[, x := c("monkey", "pie")[ 
    1   (category == 1 & shift(category, type = "lead") %in% 2)], by = k1]
)

I adjusted the duckdb formula by adding default = 0 to the lead() term to conform to the data.table answer. That confirms the same calc is happening:

compare = data.frame(k1 = z$k1, category = z$category, 
                     dt = dt$x, duckdb = duckdb$x)
compare %>%
  count(duckdb == dt)
#  duckdb == dt       n
#1         TRUE 3000000

CodePudding user response：

One option is to load {dtplyr}, which lets you use dplyr syntax and converts it to data.table syntax. To use {dtplyr}, add lazy_dt() before the {dplyr} steps and use as_tibble() at the end of the pipeline to evaluate the generated data.table code.

duckdb is faster though, comparison at bottom

Show generated data.table code

(not necessary, just used to in this answer to explain the process)

library(dtplyr)
library(dplyr, w = F)
z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
            category = sample.int(2, 3000000, replace = TRUE)) %>%
  arrange(k1, category)

z %>% 
  lazy_dt() %>% 
  group_by(k1) %>%
  mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>% 
  ungroup() %>% 
  show_query()
#> copy(`_DT1`)[, `:=`(x = fifelse(category == 1 & shift(category, 
#>     type = "lead") == 2, "pie", "monkey")), by = .(k1)]

^{Created on 2022-08-12 by the reprex package (v2.0.1.9000)}

Compare times

bench::mark(
duck = 
  tbl(con, "z_duck") |>
    group_by(k1) |>
    mutate(x = if_else(category == 1 & lead(category) == 2,  
                       "pie", "monkey")) |>
    ungroup() |>
    collect()
, dt = 
  z %>% 
    lazy_dt() %>% 
    group_by(k1) %>%
    mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>% 
    ungroup() %>% 
    as_tibble()
, dplyr =
  z %>% 
    group_by(k1) %>%
    mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>% 
    ungroup()
)
# # A tibble: 3 × 13
#   expres…¹      min   median itr/s…² mem_a…³ gc/se…⁴ n_itr  n_gc total_…⁵ result  
#   <bch:ex> <bch:tm> <bch:tm>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:tm> <list>  
# 1 duck     691.13ms 691.13ms 1.45     34.4MB   0         1     0 691.13ms <tibble>
# 2 dt         10.64s   10.64s 0.0939  107.6MB   0.939     1    10   10.64s <tibble>
# 3 dplyr       1.68m    1.68m 0.00995 880.3MB   1.20      1   121    1.68m <tibble>