Here's a reprex.
library(data.table)
# create a dummy data.table
dt <- data.table(
ts = seq.POSIXt(Sys.time(),by = 60,length.out = 1000),
regn = rpois(1000,100),
VR = rnorm(1000,50,sd = 5),
rv = rnorm(1000,500,sd = 20),
topic = sample(c("A","B","C","D","E"),size = 1000,replace = T),
hrblock = xts::align.time(seq.POSIXt(Sys.time(),by = 60,length.out = 1000),n = 60)
)
# run some groupings
# The groupings are exactly what I need in my production code except
# that here they are workng on random data. But still the code
# demonstrates the same reduction in speed as I get in the actual
# code.
microbenchmark::microbenchmark(
dt[,.(.N,t1=first(ts),t2= last(ts),
r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
watts = fifelse(regn==101,mean(VR),NA_real_),
l1 = first(rv),l2=last(rv)),
.(hrblock,topic,regn)]
)
#> Unit: milliseconds
#> expr
#> dt[, .(.N, t1 = first(ts), t2 = last(ts), r1 = fifelse(regn %in% c(100, 101, 102), first(VR), NA_real_), r2 = fifelse(regn %in% c(100, 101, 102), last(VR), NA_real_), watts = fifelse(regn == 101, mean(VR), NA_real_), l1 = first(rv), l2 = last(rv)), .(hrblock, topic, regn)]
#> min lq mean median uq max neval
#> 51.30181 54.83056 57.41794 56.55636 57.99337 90.92381 100
Created on 2022-12-19 by the reprex package (v2.0.1)
So a 1000-row data.table is taking close to 56 milliseconds, which looks quite slow. In real life, I run summaries on hundreds of thousands or a million rows, and the user interface becomes very sluggish.
Am I making any fundamental mistake in the grouping? I tried to use setkey before execution, and it did not speed up the code.
I am expecting a 5- to 10-fold improvement in the response time. Any help will be highly appreciated.
CodePudding user response:
In case you have a large share of groups that contain only 1 row, I strongly suggest splitting the task: grouping operation for those groups with N > 1 and a simple operation for those groups N == 1. By this, you can make use of vectorization and avoid the usage of unnecessary function calls
microbenchmark::microbenchmark(
at_once = dt[,.(.N,t1=first(ts),t2= last(ts),
r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
watts = fifelse(regn==101,mean(VR),NA_real_),
l1 = first(rv),l2=last(rv)),
.(hrblock,topic,regn)],
splitted = {
dt[, N := .N, by = .(hrblock,topic,regn)]
dt1 <- dt[N > 1,
.(N,
t1=first(ts),t2= last(ts),
r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
watts = fifelse(regn==101,mean(VR),NA_real_),
l1 = first(rv),l2=last(rv)),
.(hrblock,topic,regn)]
dt2 <- dt[N == 1,
.(hrblock,topic,regn,
N,
t1 = ts,
t2 = ts,
r1 = fifelse(regn %in% c(100,101,102), VR, NA_real_),
r2 = fifelse(regn %in% c(100,101,102), VR, NA_real_),
watts = fifelse(regn == 101, VR ,NA_real_),
l1 = rv,
l2 = rv)]
rbind(dt1, dt2)
}
)
#>
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> at_once 33.666042 34.058334 37.66860 34.898542 39.61844 136.997209 100
#> splitted 2.610042 2.667168 3.02075 2.972376 3.05921 8.958875 100
all.equal(splitted, at_once)
#> [1] TRUE