Suppose I have a large data.table with about 1000 columns and 100,000 rows like this:
dt <- data.table(col1 = runif(10^4))
for (i in 2:10^3) set(dt, j = paste('col', as.character(i), sep = ''), value = dt[[i-1]] * 0.95^(i-1))
Think of these as representing 'daily mortality rates'. I want to calculate monthly survival rates, so I have the following chunk of code:
dt[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]
I was not able to find any benchmarking on row-wise cumulative products like is shown above.
Can you think of any better/cleaner/faster methods for doing this in a data.table way?
(I thought of as.data.table(t(cumprod(t(dt)))
as well but this appears to be taking forever with a table this size)
CodePudding user response:
As mentionned in comments, another possible solution is to use apply
along rows:
res1 <-
copy(dt)[, paste0('surv_rate_', 1:10^3) := transpose(apply((1-.SD)^30,1,cumprod,simplify=F))]
res2 <- copy(dt)[, paste0('surv_rate_', 1:10^3) := Reduce('*', (1-dt[, paste0('col', 1:10^3)])^30, accumulate = T)]
all.equal(res1,res2)
[1] TRUE
However, your solution remains 30% faster:
Unit: milliseconds
expr min
copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), transpose(apply((1-.SD)^30, 1, cumprod, simplify = F)))] 1011.7095
copy(dt)[, `:=`(paste0("surv_rate_", 1:10^3), Reduce("*", (1-dt[, paste0("col", 1:10^3)])^30, accumulate = T))] 793.5415
lq mean median uq max neval
1246.993 1743.854 1546.797 2119.166 2772.646 10
1046.194 1314.569 1249.636 1405.414 2496.858 10