I am new to data.table, but I need to speed up a dplyr code and so far I divided the processing time by 20, so needless to say I'd like to master that library. Understand by that introduction that processing time is the essence.
I have to modify rows using a loop because some columns are inter-connected. The use of a loop is not a matter addressed here: I will use a loop and there is no workaround, for other reasons not shown. I already know what I'll use in my code to get the results needed because it's obviously the fastest, BUT I know there's a .SD way to do it, which would allow me a deeper understanding of that library, hence I seek your advices.
So, just to be clear: consider this post as an exercise that will help me understanding a subtlety in the use of .SD. I will present a simple table and a simple function (mean) but these are far from being the actual data (I use a self-made windowed mean). But if anyone can get the same results by using "mean" and .SD, then the problem is solved and I will learn something I don't know yet. Sorry for the authoritative tone I use here, I just mean to be clear: I want to know what went wrong in my approach.
The very simplified table and goals are:
temp <- data.table(a=c(0,10), b=c(15,25))
#initialize 1rst row
temp[1, `:=`(worksA=a, worksB=b)]
#in the (not shown) loop, starting row 2, worksA & worksB update a mean with fresh data:
temp[2, `:=`(worksA=mean(temp$a[1:2]), worksB=mean(temp$b[1:2]))]
Thus you get what I want (but note that I will use a self-made "mean" function with a rolling window, so actually using cummean will not do):
a b worksA worksB
1: 0 15 0 15
2: 10 25 5 20
My 1rst failure was:
temp[2, `:=`(tryA=mean(a[1:2]),tryB=mean(b[1:2]))]
creates NA's. I guessed that I could not use row selection to create a column by reference, so I dealt with it using "$" (working solution); still I suspected .SD would do the trick, so trial "number" 2:
temp[2, c("tryA", "tryB"):=lapply(.SD[1:2], mean), .SDcols=c("a", "b")]
same. Funny fact, if you use na.rm:
temp[2, `:=`(tryA=mean(a[1:2], na.rm=TRUE),tryB=mean(b[1:2], na.rm=TRUE))]
or:
temp[2, c("tryA", "tryB"):=lapply(.SD[1:2], mean, na.rm=TRUE), .SDcols=c("a", "b")]
you get tryA & tryB row 2 updated with values a & b from the same row, as if it calculated the mean only with row 2 values. Regarding to that matter, I try not to use the row selection on 1rst parameter (without "2" after 1rst braquet):
temp[, c("tryA", "tryB"):=lapply(.SD[1:2], mean), .SDcols=c("a", "b")]
which of course gives:
a b worksA worksB tryA tryB
1: 0 15 0 15 5 20
2: 10 25 5 20 5 20
i.e. the values I wanted printed on all rows. Better, but not what I want.
Microbenchmark tells me that my working solution is close to 20x faster than the lapply function anyway, so I give up on that. But can anyone explain why my attempts (except for the last one, this one is crystal clear) were wrongly coded and how I could have edited 1 row at a time with a user-defined function and .SD ?
Thanks in advance
CodePudding user response:
You could use Reduce
with accumulate=T
option to generate cumulative lists of each column:
library(data.table)
temp <- data.table(a=c(0,10), b=c(15,25))
temp[,lapply(.SD, function(x) Reduce(x,f=function(x,y) c(x,y),accumulate=T))]
# a b
# <list> <list>
#1: 0 15
#2: 0,10 15,25
So that you can apply to them any summary function (mean in this example) with sapply
:
temp[,paste0('works',colnames(temp)):=lapply(.SD, function(x) sapply(Reduce(x,f=function(x,y) c(x,y),accumulate=T),function(x) mean(unlist(x))))][]
#> a b worksa worksb
#> <num> <num> <num> <num>
#> 1: 0 15 0 15
#> 2: 10 25 5 20
Or with a recursive mean:
recursive.mean <- function(x) tail(stats::filter(x/2,1/2),1)
temp[,paste0('works',colnames(temp)):=lapply(.SD, function(x) sapply(Reduce(x,f=function(x,y) c(x,y),accumulate=T),function(x) recursive.mean(unlist(x))))][]
# a b worksa worksb
# <num> <num> <num> <num>
#1: 0 15 0.0 3.75
#2: 10 25 2.5 6.25
CodePudding user response:
OK OK, what about a good old self-answer ?
Don't take it as a narcissistic move, since I understand now how my question was... dumb. It was all a misuse of the data.table language: I wanted to edit & select at the same time.
.SD is a subset of data, so temp[2, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]
could not give anything but NA's, because I asked R to return several rows of subset (.SD[1:2]
) out of a single row data.table (temp[2,
). That 1rst part ([2,
) was considered by R for a selection, not an edition.
temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]
gave the results I wanted, so that was the untouchable part I had to edit into the table, that edition request being written in an upper hierarchy of the code, where the "2" should be.
Thus temp[2, c("worksA", "worksB"):=temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]]
does exactly what I meant: 1rst part (LHS) asks for a modification, 2nd part (RHS of :=
) is what I want to insert.
.SD will return an output, not edit itself in the table it is extracted from. Sooo obvious, my bad.
As for benchmarks, $referencing turned out to be about 2.8x faster than .SD, as expected:
microbenchmark("$ref"={temp[2, `:=`(worksA=mean(temp$a[1:2]), worksB=mean(temp$b[1:2]))]}, ".SD"={temp[2, c("worksA", "worksB"):=temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]]}, times=10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# $ref 390.601 427.2185 593.3664 479.921 558.4865 38580.97 10000 a
# .SD 1102.362 1224.5490 1681.0160 1373.182 1576.1190 269964.59 10000 b
Sorry for the fuss, problem solved, I know more now (pun !), let's go back to work. Thanks a lot to you all !