I am trying to improve performance in the following code. The objective is to find the most recent observation as of a given time (in this case from_time 5000
) for each id
:
library(data.table)
from_time <- Sys.time()
d <- cumsum(sample(c(0.02, 0.1, 0.2), 200000, TRUE))
ts <- from_time d
DF <- expand.grid(t = ts, id = 1:300)
N <- nrow(DF)
DT = data.table(t = DF$t,
id = DF$id,
x = sample(100L, N, TRUE),
y = sample(1000L, N, TRUE),
z = rnorm(N))
key(DT) # NULL
system.time(
result <- DT[
t <= from_time 5000
, .SD[order(-t), .(x, y, z)][1]
, by = .(id)
]
)
This gave me:
user system elapsed
1.78 0.78 1.72
I tried adding a key on id
and t
but it doesn't seem to help:
setkey(DT, id, t)
key(DT) #[1] "id" "t"
system.time(
result <- DT[
t <= from_time 5000
, .SD[order(-t), .(x, y, z)][1]
, by = .(id)
]
)
user system elapsed
1.90 0.31 1.83
Am I right to think that the key did not improve performance? Is there a way to improve performance here?
CodePudding user response:
One area for improved performance is when you order by t
. You don't want/need to order t
within each .SD. You could try something like this instead:
DT[t<=(from_time 5000)][order(-t),first(.SD),by=id][, t:=NULL]
CodePudding user response:
The key did not improve performance because the expression order(-t)
in i
(SD[order(-t), .(x, y, z)]
) destroys the key that you set. Instead, do not reorder the data after setting the key and use the function last
to benefit from internal optimization:
library(data.table)
setkey(DT, id, t)
system.time(DT[t <= from_time 5000, last(.SD), id][, !"t"])
user system elapsed
0.49 0.01 0.36