Modifying one row (in a loop) using data.table and .SD in R-CodePudding

I am new to data.table, but I need to speed up a dplyr code and so far I divided the processing time by 20, so needless to say I'd like to master that library. Understand by that introduction that processing time is the essence.

I have to modify rows using a loop because some columns are inter-connected. The use of a loop is not a matter addressed here: I will use a loop and there is no workaround, for other reasons not shown. I already know what I'll use in my code to get the results needed because it's obviously the fastest, BUT I know there's a .SD way to do it, which would allow me a deeper understanding of that library, hence I seek your advices.

So, just to be clear: consider this post as an exercise that will help me understanding a subtlety in the use of .SD. I will present a simple table and a simple function (mean) but these are far from being the actual data (I use a self-made windowed mean). But if anyone can get the same results by using "mean" and .SD, then the problem is solved and I will learn something I don't know yet. Sorry for the authoritative tone I use here, I just mean to be clear: I want to know what went wrong in my approach.

The very simplified table and goals are:

temp <- data.table(a=c(0,10), b=c(15,25))
#initialize 1rst row
temp[1, `:=`(worksA=a, worksB=b)]
#in the (not shown) loop, starting row 2, worksA & worksB update a mean with fresh data:
temp[2, `:=`(worksA=mean(temp$a[1:2]), worksB=mean(temp$b[1:2]))]

Thus you get what I want (but note that I will use a self-made "mean" function with a rolling window, so actually using cummean will not do):

    a  b worksA worksB
1:  0 15      0     15
2: 10 25      5     20

My 1rst failure was:

temp[2, `:=`(tryA=mean(a[1:2]),tryB=mean(b[1:2]))]

creates NA's. I guessed that I could not use row selection to create a column by reference, so I dealt with it using "$" (working solution); still I suspected .SD would do the trick, so trial "number" 2:

temp[2, c("tryA", "tryB"):=lapply(.SD[1:2], mean), .SDcols=c("a", "b")]

same. Funny fact, if you use na.rm:

temp[2, `:=`(tryA=mean(a[1:2], na.rm=TRUE),tryB=mean(b[1:2], na.rm=TRUE))]

or:

temp[2, c("tryA", "tryB"):=lapply(.SD[1:2], mean, na.rm=TRUE), .SDcols=c("a", "b")]

you get tryA & tryB row 2 updated with values a & b from the same row, as if it calculated the mean only with row 2 values. Regarding to that matter, I try not to use the row selection on 1rst parameter (without "2" after 1rst braquet):

temp[, c("tryA", "tryB"):=lapply(.SD[1:2], mean), .SDcols=c("a", "b")]

which of course gives:

    a  b worksA worksB tryA tryB
1:  0 15      0     15    5   20
2: 10 25      5     20    5   20

i.e. the values I wanted printed on all rows. Better, but not what I want.

Microbenchmark tells me that my working solution is close to 20x faster than the lapply function anyway, so I give up on that. But can anyone explain why my attempts (except for the last one, this one is crystal clear) were wrongly coded and how I could have edited 1 row at a time with a user-defined function and .SD ?

Thanks in advance

CodePudding user response：

You could use Reduce with accumulate=T option to generate cumulative lists of each column:

library(data.table)

temp <- data.table(a=c(0,10), b=c(15,25))

temp[,lapply(.SD, function(x) Reduce(x,f=function(x,y) c(x,y),accumulate=T))]
#        a      b
#   <list> <list>
#1:      0     15
#2:   0,10  15,25

So that you can apply to them any summary function (mean in this example) with sapply:

temp[,paste0('works',colnames(temp)):=lapply(.SD, function(x) sapply(Reduce(x,f=function(x,y) c(x,y),accumulate=T),function(x) mean(unlist(x))))][]
#>        a     b worksa worksb
#>    <num> <num>  <num>  <num>
#> 1:     0    15      0     15
#> 2:    10    25      5     20

Or with a recursive mean:

recursive.mean <- function(x) tail(stats::filter(x/2,1/2),1)
temp[,paste0('works',colnames(temp)):=lapply(.SD, function(x) sapply(Reduce(x,f=function(x,y) c(x,y),accumulate=T),function(x) recursive.mean(unlist(x))))][]

#       a     b worksa worksb
#   <num> <num>  <num>  <num>
#1:     0    15    0.0   3.75
#2:    10    25    2.5   6.25

CodePudding user response：

OK OK, what about a good old self-answer ?

Don't take it as a narcissistic move, since I understand now how my question was... dumb. It was all a misuse of the data.table language: I wanted to edit & select at the same time.

.SD is a subset of data, so temp[2, lapply(.SD[1:2], mean), .SDcols=c("a", "b")] could not give anything but NA's, because I asked R to return several rows of subset (.SD[1:2]) out of a single row data.table (temp[2,). That 1rst part ([2,) was considered by R for a selection, not an edition.

temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")] gave the results I wanted, so that was the untouchable part I had to edit into the table, that edition request being written in an upper hierarchy of the code, where the "2" should be.

Thus temp[2, c("worksA", "worksB"):=temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]] does exactly what I meant: 1rst part (LHS) asks for a modification, 2nd part (RHS of :=) is what I want to insert. .SD will return an output, not edit itself in the table it is extracted from. Sooo obvious, my bad.

As for benchmarks, $referencing turned out to be about 2.8x faster than .SD, as expected:

microbenchmark("$ref"={temp[2, `:=`(worksA=mean(temp$a[1:2]), worksB=mean(temp$b[1:2]))]}, ".SD"={temp[2, c("worksA", "worksB"):=temp[, lapply(.SD[1:2], mean), .SDcols=c("a", "b")]]}, times=10000L)
#Unit: microseconds
# expr      min        lq      mean   median        uq       max neval cld
# $ref  390.601  427.2185  593.3664  479.921  558.4865  38580.97 10000  a 
#  .SD 1102.362 1224.5490 1681.0160 1373.182 1576.1190 269964.59 10000   b

Sorry for the fuss, problem solved, I know more now (pun !), let's go back to work. Thanks a lot to you all !