Can I improve data.table performance with inequality condition-CodePudding

I am trying to improve performance in the following code. The objective is to find the most recent observation as of a given time (in this case from_time 5000) for each id:

library(data.table)

from_time <- Sys.time()
d <- cumsum(sample(c(0.02, 0.1, 0.2), 200000, TRUE))
ts <- from_time   d
DF <- expand.grid(t = ts, id = 1:300)
N <- nrow(DF)

DT = data.table(t = DF$t,
                id = DF$id,
                x = sample(100L, N, TRUE),
                y = sample(1000L, N, TRUE),
                z = rnorm(N))

key(DT) # NULL

system.time(
    result <- DT[
        t <= from_time   5000
        , .SD[order(-t), .(x, y, z)][1]
        , by = .(id)
    ]
)

This gave me:

user  system elapsed 
1.78    0.78    1.72

I tried adding a key on id and t but it doesn't seem to help:

setkey(DT, id, t)
key(DT) #[1] "id" "t" 

system.time(
    result <- DT[
        t <= from_time   5000
        , .SD[order(-t), .(x, y, z)][1]
        , by = .(id)
    ]
)


user  system elapsed 
1.90    0.31    1.83

Am I right to think that the key did not improve performance? Is there a way to improve performance here?

CodePudding user response：

One area for improved performance is when you order by t. You don't want/need to order t within each .SD. You could try something like this instead:

DT[t<=(from_time 5000)][order(-t),first(.SD),by=id][, t:=NULL]

CodePudding user response：

The key did not improve performance because the expression order(-t) in i (SD[order(-t), .(x, y, z)]) destroys the key that you set. Instead, do not reorder the data after setting the key and use the function last to benefit from internal optimization:

library(data.table)

setkey(DT, id, t)

system.time(DT[t <= from_time   5000, last(.SD), id][, !"t"])

user  system elapsed 
0.49    0.01    0.36