Home > Software engineering >  Why indexing .SD with a number is faster than when the number is stored in a variable
Why indexing .SD with a number is faster than when the number is stored in a variable

Time:01-29

I have a data.table that I want to aggregate by group (customer ID) and look at how their first spend compares to the second spend, third spend etc.

Select e.g. the first row by group using .SD[1]:

agg_dt <- dt[ , .SD[1], by = customer_id]

gives a really quick result (1.94 seconds for a dataset with 5 million rows).

However, if I put the index number (here 1) into a variable (which I want to do so that I can turn this into a function that I can call repeatedly with different numbers) it suddenly goes much slower (5.23 MINUTES).

number_var <- 1
agg_dt <- dt[ , .SD[number_var], by = customer_id]

As far as I am concerned using a number directly and using a number via a variable should make no difference, so what is causing this and how can I fix it?

To show the code fully, I have made a toy example on kaggle:

https://www.kaggle.com/code/keithcooper/data-table-weird-result/notebook

It is much smaller, but even there the difference is 0.00995 secs to 2.71042 secs.

CodePudding user response:

Indexing .SD with a number is "GForce optimized", whereas indexing with the number stored in a variable is not optimized.

This can be verified by setting verbose = TRUE ("turns on status and information messages to the console")

library(data.table)
d = data.table(g = rep(1:2, each = 2), v = 1:4)

For each group, index .SD with a number, e.g. 1:

d[ , .SD[1], by = g, verbose = TRUE]
#...snip... 
# lapply optimization changed j from '.SD[1]' to 'list(v[1])'
# GForce optimized j to 'list(`g[`(v, 1))'
# Making each group and running j (GForce TRUE)
#...snip...
#    g v
# 1: 1 1
# 2: 2 3 

Index .SD with a number stored in a variable

ix = 1
d[ , .SD[ix], by = g, verbose = TRUE]
#...snip...
# lapply optimization is on, j unchanged as '.SD[ix]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... The result of j is a named list.
# It's very inefficient to create the same names over and over again for each group.

Performance can be improved considerably by using .I[ix] instead. However, neither that operation is GForce optimized.

d[d[, .I[ix], by = g, verbose = TRUE]$V1]
#...snip...
# lapply optimization is on, j unchanged as '.I[ix]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE)

As noted by @jangorecki, in the development version of data.table, the env argument can be used (here env = list(ix = ix), with 'ix' as above). Then GForce is optimized also when the index is stored in a variable.

# data.table::update_dev_pkg()
library(data.table)
# data.table 1.14.7 IN DEVELOPMENT 

Indexing .SD:

d[ , .SD[ix], by = g, env = list(ix = ix), verbose = TRUE]
#...snip...
# lapply optimization changed j from '.SD[1]' to 'list(v[1])'
# GForce optimized j to 'list(`g[`(v, 1))'
# Making each group and running j (GForce TRUE)  

Indexing .I:

d[d[, .I[ix], by = g, env = list(ix = ix), verbose = TRUE]$V1]
#...snip...
# lapply optimization is on, j unchanged as '.I[1]'
# GForce optimized j to '`g[`(.I, 1)'
# Making each group and running j (GForce TRUE)

See also ?datatable.optimize.

  • Related