how to speed up a bootstrapped vector generation in R-CodePudding

Toy data.table

Consider this data.table

library(pacman)
p_load(data.table,magrittr,dplyr,glue)

dt <- data.table(x = c(1,3,4,5,8,12,13,20,21,25), 
           y = c(1,1,2,2,8,2,4,6,5,5),keep.rownames = T)
dt[,newval:=NA_real_]
dt[,rn:=as.integer(rownames(dt))]
dt[1,newval:=y] 
dt[,x_pre := shift(x,n = 1)]
dt[,x_nxt := shift(x,n = -1)]
setcolorder(dt,"rn")
dt[]
#>     rn  x y newval x_pre x_nxt
#>  1:  1  1 1      1    NA     3
#>  2:  2  3 1     NA     1     4
#>  3:  3  4 2     NA     3     5
#>  4:  4  5 2     NA     4     8
#>  5:  5  8 8     NA     5    12
#>  6:  6 12 2     NA     8    13
#>  7:  7 13 4     NA    12    20
#>  8:  8 20 6     NA    13    21
#>  9:  9 21 5     NA    20    25
#> 10: 10 25 5     NA    21    NA

# note the last 2 columns are simply the shifted values of x

Use of for loop

The following is an inefficient function using for loop in R for bootstrapping a data.tale column.


# function using a for loop on each observation
 func_loop <- function(dt){
   # create a for loop for updating the newval column iteratively
   for(i in seq_len(nrow(dt))[-c((nrow(dt) - c(0:1)))]){
     dt[i   2,newval:=y] # temporary value to be erased later
     dt[,new_pre:=shift(newval, n = 1)]
     dt[,new_nxt:=shift(newval, n = -1)]
     # the following line of code uses the previously computed value (new_pre)
     dt[rn > 1,newval:=ifelse(rn==i 1, new_pre   (new_nxt - new_pre)* (x - x_pre) /((x_nxt - x_pre)),newval) ]
     dt[rn==i 2,newval:=NA_real_]
   }
   dt
 }

Call the `for - loop` function

 # call the function 
 func_loop(dt)[]
#>     rn  x y   newval x_pre x_nxt  new_pre  new_nxt
#>  1:  1  1 1 1.000000    NA     3       NA 1.666667
#>  2:  2  3 1 1.666667     1     4 1.000000 1.833333
#>  3:  3  4 2 1.833333     3     5 1.666667 3.375000
#>  4:  4  5 2 3.375000     4     8 1.833333 2.785714
#>  5:  5  8 8 2.785714     5    12 3.375000 3.757143
#>  6:  6 12 2 3.757143     8    13 2.785714 4.037500
#>  7:  7 13 4 4.037500    12    20 3.757143 4.879688
#>  8:  8 20 6 4.879688    13    21 4.037500       NA
#>  9:  9 21 5 4.903750    20    25 4.879688 5.000000
#> 10: 10 25 5       NA    21    NA       NA       NA

# benchmark the speed
 microbenchmark::microbenchmark(func_loop(dt))
#> Unit: milliseconds
#>           expr      min       lq     mean   median       uq      max neval
#>  func_loop(dt) 23.00165 24.24735 26.19917 25.11379 27.11327 39.43801   100

^{Created on 2022-07-19 by the reprex package (v2.0.1)}

Expectedly this gives a terrible efficiency of 30 msec for 10 rows, which means for a million rows it will take 50 minutes. I have several million rows to be computed.

I am aware of froll* series and use them extensively but here I am unable to apply frollapply, since this algo has a dependency on previous computation.

I have tried data.table::set also and that does'nt reduce drastically the time due to the fact that we have to call dt[] repeatedly which is an expensive call. See Henrik's comments below.

I am looking to improve the performance by several orders of magnitude and not just 20 or 40%. I would expect a 1/10th or 1/50th of the current response times with a good vector algorithm.

CodePudding user response：

A simple Rcpp function will be much faster.

library(data.table)

Rcpp::cppFunction(
  "NumericVector iterInterp(const NumericVector& x, const NumericVector& y) {
    const int n = x.size();
    NumericVector newval(n);
    newval(0) = y(0);
    newval(n - 1) = NA_REAL;
    
    for (int i = 1; i < n - 1; i  ) {
      newval(i) = newval(i - 1)   (y(i   1) - newval(i - 1))*(x(i) - x(i - 1))/(x(i   1) - x(i - 1));
    }
    
    return newval;
  }"
)

dt <- data.table(
  x = c(1,3,4,5,8,12,13,20,21,25),
  y = c(1,1,2,2,8,2,4,6,5,5)
)

microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)])
#> Unit: microseconds
#>        expr   min    lq    mean median    uq   max neval
#>  iterInterp 153.5 156.9 164.894  159.7 163.8 391.8   100

dt
#>     x y   newval
#> 1   1 1 1.000000
#> 2   3 1 1.666667
#> 3   4 2 1.833333
#> 4   5 2 3.375000
#> 5   8 8 2.785714
#> 6  12 2 3.757143
#> 7  13 4 4.037500
#> 8  20 6 4.879688
#> 9  21 5 4.903750
#> 10 25 5       NA

That comes out to < 3 minutes for 10M rows, except the overhead does not scale with the size of the data.table, as shown by benchmarking:

dt <- data.table(
  x = rep(c(1,3,4,5,8,12,13,20,21,25), 1e6)   25*rep(0:(1e6 - 1L), each = 10),
  y = rep(c(1,1,2,2,8,2,4,6,5,5), 1e6)
)

microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)])
#> Unit: milliseconds
#>        expr     min       lq     mean   median       uq     max neval
#>  iterInterp 157.585 159.0541 178.3298 168.0882 172.2245 274.102   100

That's a fraction of a second for 10M rows.

CodePudding user response：

This frollapply solution seems to get quite close to desired output:

dt[,newval_roll:=shift(frollapply(vol,n=n,align = 'center',FUN=median,fill=NA),-1)]

dt$newval==dt$newval_roll
#[1]    NA    NA    NA    NA FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[19]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[55]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[91]  TRUE  TRUE  TRUE  TRUE    NA    NA    NA    NA    NA    NA

Not sure I fully understand the approxfun bit as it applies to already existing coordinates, meaning that no approximation is needed.

If filling of the first elements is needed, you could also use zoo::rollapply which allows a partial calculation. Other possibity is to run the loop only for the first elements.

Toy data.table

Use of for loop

Call the for - loop function

Call the `for - loop` function