Consider the two methods below to add columns to an existing data.table. One is the chaining of data.table calls using []
and the other is the classic single variables addition each time by :=
Both methods utlize the same memory (1.66GB) at the end but one of the two looks around 15 ~ 20%
faster.
My question is:
is this speed increase a fluke or a real one?
And if real, why is it so?
library(pacman)
p_load(reprex, data.table,magrittr,pryr)
mem_used()
#> 53.7 MB
# initialise a large data.table
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 454 MB
Method 1
system.time(dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a b])
#> user system elapsed
#> 1.379 0.589 1.968
mem_used()
#> 1.66 GB
release the memory to start again
rm(dt1,dt2)
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 795598 42.5 1291680 69.0 1291680 69
#> Vcells 1424205 10.9 234120480 1786.2 201449048 1537
mem_used()
#> 55.9 MB
dt1 <- data.table(x= 1:5e7, y = 5e7:1)
mem_used()
#> 456 MB
Method 2
system.time({
dt1[,a:=log(x)]
dt1[,b:=log(y)]
dt1[,c := a b]
})
#> user system elapsed
#> 1.207 0.472 1.679
mem_used()
#> 1.66 GB
As you see Method 2 is 15 ~ 17% faster. Why?
Created on 2022-07-08 by the reprex package (v2.0.1)
CodePudding user response:
The difference isn't significant : repeating calculation many times (default = 100) with microbenchmark
shows there's no difference
microbenchmark::microbenchmark(chain={dt2 <- dt1[,a:=log(x)][,b:=log(y)][,c := a b]},
seq = {dt1[,a:=log(x)]
dt1[,b:=log(y)]
dt1[,c := a b]})
Unit: seconds
expr min lq mean median uq max neval
chain 3.056398 3.123273 3.207696 3.204743 3.270068 3.500883 100
seq 3.060816 3.131185 3.208122 3.222308 3.273654 3.483277 100
system.time()
isn't precise enough to measure a 20% difference.