data.table runs slower than expected in grouping-CodePudding

Here's a reprex.

library(data.table)

# create a dummy data.table
dt <- data.table(
  ts = seq.POSIXt(Sys.time(),by = 60,length.out = 1000),
  regn = rpois(1000,100),
  VR = rnorm(1000,50,sd = 5),
  rv = rnorm(1000,500,sd = 20),
  topic = sample(c("A","B","C","D","E"),size = 1000,replace = T),
  hrblock = xts::align.time(seq.POSIXt(Sys.time(),by = 60,length.out = 1000),n = 60)
  )

# run some groupings 
# The groupings are exactly what I need in my production code except
# that here they are workng on random data. But still the code 
# demonstrates the same reduction in speed as I get in the actual 
# code.
microbenchmark::microbenchmark( 
  dt[,.(.N,t1=first(ts),t2= last(ts),
     r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
     r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
     watts = fifelse(regn==101,mean(VR),NA_real_),
     l1 = first(rv),l2=last(rv)),
  .(hrblock,topic,regn)]
  )
#> Unit: milliseconds
#>                                                                                                                                                                                                                                                                                                   expr
#>  dt[, .(.N, t1 = first(ts), t2 = last(ts), r1 = fifelse(regn %in%      c(100, 101, 102), first(VR), NA_real_), r2 = fifelse(regn %in%      c(100, 101, 102), last(VR), NA_real_), watts = fifelse(regn ==      101, mean(VR), NA_real_), l1 = first(rv), l2 = last(rv)),      .(hrblock, topic, regn)]
#>       min       lq     mean   median       uq      max neval
#>  51.30181 54.83056 57.41794 56.55636 57.99337 90.92381   100

^{Created on 2022-12-19 by the reprex package (v2.0.1)}

So a 1000-row data.table is taking close to 56 milliseconds, which looks quite slow. In real life, I run summaries on hundreds of thousands or a million rows, and the user interface becomes very sluggish.

Am I making any fundamental mistake in the grouping? I tried to use setkey before execution, and it did not speed up the code.

I am expecting a 5- to 10-fold improvement in the response time. Any help will be highly appreciated.

CodePudding user response：

In case you have a large share of groups that contain only 1 row, I strongly suggest splitting the task: grouping operation for those groups with N > 1 and a simple operation for those groups N == 1. By this, you can make use of vectorization and avoid the usage of unnecessary function calls

microbenchmark::microbenchmark( 
  at_once = dt[,.(.N,t1=first(ts),t2= last(ts),
                  r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
                  r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
                  watts = fifelse(regn==101,mean(VR),NA_real_),
                  l1 = first(rv),l2=last(rv)),
               .(hrblock,topic,regn)],
  splitted = {
    dt[, N := .N, by = .(hrblock,topic,regn)]
    dt1 <- dt[N > 1,
              .(N,
                t1=first(ts),t2= last(ts),
                r1 = fifelse(regn %in% c(100,101,102), first(VR),NA_real_),
                r2 = fifelse(regn %in% c(100,101,102), last(VR),NA_real_),
                watts = fifelse(regn==101,mean(VR),NA_real_),
                l1 = first(rv),l2=last(rv)),
              .(hrblock,topic,regn)]
    dt2 <- dt[N == 1,
              .(hrblock,topic,regn,
                N,
                t1 = ts,
                t2 = ts,
                r1 = fifelse(regn %in% c(100,101,102), VR, NA_real_),
                r2 = fifelse(regn %in% c(100,101,102), VR, NA_real_),
                watts = fifelse(regn == 101, VR ,NA_real_),
                l1 = rv,
                l2 = rv)]
    rbind(dt1, dt2)
  }
)
#>
#> Unit: milliseconds
#>     expr       min        lq     mean    median       uq        max neval
#> at_once  33.666042 34.058334 37.66860 34.898542 39.61844 136.997209   100
#> splitted  2.610042  2.667168  3.02075  2.972376  3.05921   8.958875   100

all.equal(splitted, at_once)
#> [1] TRUE