I've written a query in R on a large-ish data frame (200k rows, about 50 columns)
Small reprex:
library(dplyr)
library(lubridate)
df <- data.frame(index = c(1, 1, 1, 1,
2, 2, 2, 2),
date_1 = as.Date(c("2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01",
"2020-07-01", "2021-07-01", "2022-07-01", "2023-07-01")))
df <- df %>%
group_by(index) %>%
mutate(date_2 = lead(date_1) %m-% days(1)) %>%
ungroup()
It's very slow. The group_by()
stage (which is important) on my real life data frame chops it up into about 40-50k groups.
Is there any easy way to speed this up pls? (preferably using tidyverse
stuff). I'm wondering if the slowness is likely to be because of the group_by()
, the lead()
or the lubridate
stuff (%m-% days(1)
)
Thank you.
CodePudding user response:
It's difficult to benchmark "well" on data this small, I'll explode a bit:
library(dplyr)
library(data.table)
library(lubridate)
df200 <- bind_rows(lapply(1:200, function(i) transform(df, index = index 2*i)))
df200DT <- as.data.table(df200)
out1 <- df200 %>%
group_by(index) %>%
mutate(date_2 = lead(date_1) %m-% days(1)) %>%
ungroup()
out2 <- df200DT[, date_2 := shift(date_1 %m-% days(1), type="lead"), by = index]
all.equal(out1, out2, check.attributes = FALSE)
# [1] TRUE
Simple benchmark reveals slight differences. (Note that I used two different R sessions to be able to compare dplyr-1.0.10 with dplyr-1.1.0, recently released with some big changes.)
bench::mark(
# in a different process
dplyr_1.0.10 = df200 %>% group_by(index) %>% mutate(date_2 = lead(date_1) %m-% days(1)) %>% ungroup(),
# in this process :-)
dplyr_1.1.0 = df200 %>% group_by(index) %>% mutate(date_2 = lead(date_1) %m-% days(1)) %>% ungroup(),
dt = df200DT[, date_2 := shift(date_1 %m-% days(1), type="lead"), by = index],
check = FALSE, min_iterations = 50)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr_1.0.10 360ms 374ms 2.63 594KB 14.5 50 275 19s <NULL> <Rprofmem [1,706 × 3]> <bench_tm [50]> <tibble [50 × 3]>
# 1 dplyr_1.1.0 350ms 369ms 2.62 594KB 19.3 50 368 19.1s <NULL> <Rprofmem [1,662 × 3]> <bench_tm [50]> <tibble [50 × 3]>
# 2 dt 295ms 310ms 3.17 298KB 19.9 50 315 15.8s <NULL> <Rprofmem [811 × 3]> <bench_tm [50]> <tibble [50 × 3]>
So it doesn't appear to be something that dplyr-1.1.0 or data.table can improve by itself. However, if you remove the use of %m-%
, you can get some significant improvements. (This works more easily with dates since subtraction of days is straight-forward ... if you are subtracting months, it becomes a little more problematic.)
bench::mark(
dplyr_1.1.0_nolubridate = df200 %>% group_by(index) %>% mutate(date_2 = lead(date_1 - 1L)) %>% ungroup(),
dt_nolubridate = df200DT[, date_2 := shift(date_1 - 1L, type="lead"), by = index],
ave = cbind(df200, date_2 = ave(df200$date_1, df200$index, FUN = \(x) dplyr::lead(x) %m-% days(1))),
ave1 = cbind(df200, date_2 = ave(df200$date_1, df200$index, FUN = \(x) dplyr::lead(x) - 1)),
check = FALSE, min_iterations = 50)
>
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr_1.1.0_nolubridate 27.5ms 31.7ms 31.8 221KB 21.6 50 34 1.57s <NULL> <Rprofmem [467 × 3]> <bench_tm [50]> <tibble [50 × 3]>
# 2 dt_nolubridate 5.77ms 6.62ms 137. 42.01KB 25.7 69 13 505.49ms <NULL> <Rprofmem [8 × 3]> <bench_tm [69]> <tibble [69 × 3]>
# 3 ave 360.27ms 399.78ms 2.48 5.6MB 18.1 50 366 20.18s <NULL> <Rprofmem [3,177 × 3]> <bench_tm [50]> <tibble [50 × 3]>
# 4 ave1 38.47ms 45.15ms 20.9 5.28MB 23.0 50 55 2.39s <NULL> <Rprofmem [1,260 × 3]> <bench_tm [50]> <tibble [50 × 3]>
I added GKis' ave
code for comparison, and I did not rerun dplyr-1.0.10 (since it did not appear to change much). Note that while dplyr
is quite a bit faster without lubridate
, the data.table
solution is significantly faster without it, lending to its internal optimization of some code-paths.