I have a tibble with a lot of groups, and I want to do group-wise operations on it (highly simplified mutate below).
z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)) %>%
arrange(k1, category)
t1 <- z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
This operation is very slow, but if I instead do grouping "manually", the process is hard to read, more annoying to write, but much (20x) faster.
z %>%
mutate(x = if_else(category == 1 & lead(category) == 2 & k1 == lead(k1), "pie", "monkey"),
x = if_else(category == 1 & k1 != lead(k1), NA_character_, x))
So clearly there is some way with keys to speed up the process. Is there a better way to do this? I tried with data.table, but it was still much slower than the manual technique.
zDT <- z %>% data.table::as.data.table()
zDT[, x := if_else(category == 1 & lead(category) == 2, "pie", "monkey"), by = "k1"]
Any advice for a natural, fast way to do this operation?
CodePudding user response:
We can speed up without having to use ifelse
library(data.table)
> system.time(setDT(z)[, x := c("monkey", "pie")[
1 (category == 1 & shift(category, type = "lead") %in% 2)], by = k1])
user system elapsed
18.203 0.146 16.635
> system.time({t1 <- z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
})
user system elapsed
37.319 0.321 37.523
CodePudding user response:
Doing these grouped comparisons is going to be relatively expensive. It's better to vectorize over the whole table if possible. Notice that ifelse
is faster than if_else
and data.table
's shift
is faster than lead
.
library(data.table)
library(dplyr)
z <- setorder(data.table(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)))
t1 <- copy(z)
t2 <- copy(z)
t3 <- copy(z)
t4 <- copy(z)
t5 <- copy(z)
microbenchmark::microbenchmark(
if_else = t1[, x := if_else(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
ifelse = t2[, x := ifelse(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
shift = t3[, x := ifelse(category == 1L & shift(category, -1) == 2L, "pie", "monkey"), k1],
ifelse3 = t4[, x := ifelse(category == 1L, ifelse(k1 == shift(k1, -1), ifelse(shift(category, -1) == 2L, "pie", "monkey"), NA_character_), "monkey")],
logic = t5[, x := c("monkey", NA_character_, "monkey", "pie")[((k1 == shift(k1, -1, 0L))*((shift(category, -1, 0L) == 2) 1L) 1L)*(category == 1) 1L]],
times = 1,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> if_else 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484 1
#> ifelse 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634 1
#> shift 9057.7585 9057.7585 9057.7585 9057.7585 9057.7585 9057.7585 1
#> ifelse3 1544.2912 1544.2912 1544.2912 1544.2912 1544.2912 1544.2912 1
#> logic 81.9844 81.9844 81.9844 81.9844 81.9844 81.9844 1
The complexity in the logic is mostly due to the NA
behavior. If monkey
could take the place of NA
, t5
could be instead:
t5[, x := c("monkey", "pie")[((k1 == shift(k1, -1, 0L))*(shift(category, -1, 0L) == 2)*(k1 == shift(k1, -1, 0L))) 1L]]
CodePudding user response:
EDIT: duckdb for the win! 10x faster than @akrun's data.table solution with the same output.
EDIT #2: small nit in OP where lead default not specified, leading to NA's that duckdb was replicating but which are treated as "monkey" in @akrun's data.table answer.
For benchmarking curiosity I looked at the duckdb
and collapse
packages, which both offer a version of dplyr masking / translation to a faster back-end. The collapse version was a smidge faster but duckdb
was 10x as fast.
Unit: milliseconds
expr min lq mean median uq max neval
duckdb 809.5969 825.1131 851.222 845.6702 868.2173 900.495 10
Unit: seconds
expr min lq mean median uq max neval
collapse 8.363416 8.456532 8.633155 8.582542 8.835366 8.926974 10
dt 9.211959 9.243295 9.330174 9.324183 9.433316 9.457501 10
I split the benchmarking into two parts because it looked like I couldn't have both collapse
and duckdb
masking dplyr at the same time.
Part 1
library(DBI); library(duckdb)
con <- dbConnect(duckdb())
duckdb_register(con, "z_duck", z)
microbenchmark::microbenchmark(times = 10,
duckdb = tbl(con, "z_duck") |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category, default = 0) == 2, # EDIT to set default when there lead(Category) is NA at the end of a group, to match data.table answer
"pie", "monkey")) |>
ungroup() |>
collect())
Part 2 (after restarting R fresh)
library(data.table)
library(collapse)
options(collapse_mask = "all")
microbenchmark::microbenchmark(times = 5,
collapse = z |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category) == 2,
"pie", "monkey")) |>
ungroup() |>
collect(),
dt = setDT(z)[, x := c("monkey", "pie")[
1 (category == 1 & shift(category, type = "lead") %in% 2)], by = k1]
)
I adjusted the duckdb formula by adding default = 0
to the lead()
term to conform to the data.table answer. That confirms the same calc is happening:
compare = data.frame(k1 = z$k1, category = z$category,
dt = dt$x, duckdb = duckdb$x)
compare %>%
count(duckdb == dt)
# duckdb == dt n
#1 TRUE 3000000
CodePudding user response:
One option is to load {dtplyr}, which lets you use dplyr
syntax and converts it to data.table syntax. To use {dtplyr}, add lazy_dt()
before the {dplyr} steps and use as_tibble()
at the end of the pipeline to evaluate the generated data.table code.
duckdb is faster though, comparison at bottom
Show generated data.table code
(not necessary, just used to in this answer to explain the process)
library(dtplyr)
library(dplyr, w = F)
z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)) %>%
arrange(k1, category)
z %>%
lazy_dt() %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup() %>%
show_query()
#> copy(`_DT1`)[, `:=`(x = fifelse(category == 1 & shift(category,
#> type = "lead") == 2, "pie", "monkey")), by = .(k1)]
Created on 2022-08-12 by the reprex package (v2.0.1.9000)
Compare times
bench::mark(
duck =
tbl(con, "z_duck") |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category) == 2,
"pie", "monkey")) |>
ungroup() |>
collect()
, dt =
z %>%
lazy_dt() %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup() %>%
as_tibble()
, dplyr =
z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
)
# # A tibble: 3 × 13
# expres…¹ min median itr/s…² mem_a…³ gc/se…⁴ n_itr n_gc total_…⁵ result
# <bch:ex> <bch:tm> <bch:tm> <dbl> <bch:b> <dbl> <int> <dbl> <bch:tm> <list>
# 1 duck 691.13ms 691.13ms 1.45 34.4MB 0 1 0 691.13ms <tibble>
# 2 dt 10.64s 10.64s 0.0939 107.6MB 0.939 1 10 10.64s <tibble>
# 3 dplyr 1.68m 1.68m 0.00995 880.3MB 1.20 1 121 1.68m <tibble>