I have a data.table that I want to aggregate by group (customer ID) and look at how their first spend compares to the second spend, third spend etc.
Select e.g. the first row by group using .SD[1]
:
agg_dt <- dt[ , .SD[1], by = customer_id]
gives a really quick result (1.94 seconds for a dataset with 5 million rows).
However, if I put the index number (here 1) into a variable (which I want to do so that I can turn this into a function that I can call repeatedly with different numbers) it suddenly goes much slower (5.23 MINUTES).
number_var <- 1
agg_dt <- dt[ , .SD[number_var], by = customer_id]
As far as I am concerned using a number directly and using a number via a variable should make no difference, so what is causing this and how can I fix it?
To show the code fully, I have made a toy example on kaggle:
https://www.kaggle.com/code/keithcooper/data-table-weird-result/notebook
It is much smaller, but even there the difference is 0.00995 secs to 2.71042 secs.
CodePudding user response:
Indexing .SD
with a number is "GForce optimized", whereas indexing with the number stored in a variable is not optimized.
This can be verified by setting verbose = TRUE
("turns on status and information messages to the console")
library(data.table)
d = data.table(g = rep(1:2, each = 2), v = 1:4)
For each group, index .SD
with a number, e.g. 1:
d[ , .SD[1], by = g, verbose = TRUE]
#...snip...
# lapply optimization changed j from '.SD[1]' to 'list(v[1])'
# GForce optimized j to 'list(`g[`(v, 1))'
# Making each group and running j (GForce TRUE)
#...snip...
# g v
# 1: 1 1
# 2: 2 3
Index .SD
with a number stored in a variable
ix = 1
d[ , .SD[ix], by = g, verbose = TRUE]
#...snip...
# lapply optimization is on, j unchanged as '.SD[ix]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... The result of j is a named list.
# It's very inefficient to create the same names over and over again for each group.
Performance can be improved considerably by using .I[ix]
instead. However, neither that operation is GForce optimized.
d[d[, .I[ix], by = g, verbose = TRUE]$V1]
#...snip...
# lapply optimization is on, j unchanged as '.I[ix]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE)
As noted by @jangorecki, in the development version of data.table
, the env
argument can be used (here env = list(ix = ix)
, with 'ix' as above). Then GForce is optimized also when the index is stored in a variable.
# data.table::update_dev_pkg()
library(data.table)
# data.table 1.14.7 IN DEVELOPMENT
Indexing .SD
:
d[ , .SD[ix], by = g, env = list(ix = ix), verbose = TRUE]
#...snip...
# lapply optimization changed j from '.SD[1]' to 'list(v[1])'
# GForce optimized j to 'list(`g[`(v, 1))'
# Making each group and running j (GForce TRUE)
Indexing .I
:
d[d[, .I[ix], by = g, env = list(ix = ix), verbose = TRUE]$V1]
#...snip...
# lapply optimization is on, j unchanged as '.I[1]'
# GForce optimized j to '`g[`(.I, 1)'
# Making each group and running j (GForce TRUE)
See also ?datatable.optimize
.