Home > Software design >  Row-wise cumulative product on large data.table benchmarking
Row-wise cumulative product on large data.table benchmarking

Time:12-05

Suppose I have a large data.table with about 1000 columns and 100,000 rows like this:

dt <- data.table(col1 = runif(10^4))
for (i in 2:10^3) set(dt, j = paste('col', as.character(i), sep = ''), value = dt[[i-1]] * 0.95^(i-1))

Think of these as representing 'daily mortality rates'. I want to calculate monthly survival rates, so I have the following chunk of code:

dt[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]

I was not able to find any benchmarking on row-wise cumulative products like is shown above.

Can you think of any better/cleaner/faster methods for doing this in a data.table way?

(I thought of as.data.table(t(cumprod(t(dt))) as well but this appears to be taking forever with a table this size)

CodePudding user response:

As mentionned in comments, another possible solution is to use apply along rows:

res1 <- 
 copy(dt)[, paste0('surv_rate_', 1:10^3) := transpose(apply((1-.SD)^30,1,cumprod,simplify=F))] 

res2 <- copy(dt)[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]

all.equal(res1,res2)
[1] TRUE

However, your solution remains 30% faster:

Unit: milliseconds
                                                                                                                   expr       min
          copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), transpose(apply((1-.SD)^30, 1, cumprod, simplify = F)))] 1011.7095
 copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), Reduce("*", (1-dt[, paste0("col", 1:10^3)])^30, accumulate = T))]  793.5415
       lq     mean   median       uq      max neval
 1246.993 1743.854 1546.797 2119.166 2772.646    10
 1046.194 1314.569 1249.636 1405.414 2496.858    10

  • Related