which of the two is more efficient to add new columns in a data.table? And why?-CodePudding

Consider the two methods below to add columns to an existing data.table. One is the chaining of data.table calls using [] and the other is the classic single variables addition each time by :=

Both methods utlize the same memory (1.66GB) at the end but one of the two looks around 15 ~ 20% faster. My question is: is this speed increase a fluke or a real one? And if real, why is it so?

library(pacman)
p_load(reprex, data.table,magrittr,pryr)
mem_used()
#> 53.7 MB
# initialise a large data.table 
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 454 MB

Method 1

system.time(dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a   b])
#>    user  system elapsed 
#>   1.379   0.589   1.968
mem_used()
#> 1.66 GB

release the memory to start again

rm(dt1,dt2)
gc()
#>           used (Mb) gc trigger   (Mb)  max used (Mb)
#> Ncells  795598 42.5    1291680   69.0   1291680   69
#> Vcells 1424205 10.9  234120480 1786.2 201449048 1537
mem_used()
#> 55.9 MB
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 456 MB

Method 2

system.time({
dt1[,a:=log(x)]
dt1[,b:=log(y)]
dt1[,c := a   b]
})
#>    user  system elapsed 
#>   1.207   0.472   1.679
mem_used()
#> 1.66 GB

As you see Method 2 is 15 ~ 17% faster. Why?

^{Created on 2022-07-08 by the reprex package (v2.0.1)}

CodePudding user response：

The difference isn't significant : repeating calculation many times (default = 100) with microbenchmark shows there's no difference

microbenchmark::microbenchmark(chain={dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a   b]},
                               seq = {dt1[,a:=log(x)]
                                 dt1[,b:=log(y)]
                                 dt1[,c := a   b]})

Unit: seconds
  expr      min       lq     mean   median       uq      max neval
 chain 3.056398 3.123273 3.207696 3.204743 3.270068 3.500883   100
   seq 3.060816 3.131185 3.208122 3.222308 3.273654 3.483277   100

system.time() isn't precise enough to measure a 20% difference.