Toy data.table
Consider this data.table
library(pacman)
p_load(data.table,magrittr,dplyr,glue)
dt <- data.table(x = c(1,3,4,5,8,12,13,20,21,25),
y = c(1,1,2,2,8,2,4,6,5,5),keep.rownames = T)
dt[,newval:=NA_real_]
dt[,rn:=as.integer(rownames(dt))]
dt[1,newval:=y]
dt[,x_pre := shift(x,n = 1)]
dt[,x_nxt := shift(x,n = -1)]
setcolorder(dt,"rn")
dt[]
#> rn x y newval x_pre x_nxt
#> 1: 1 1 1 1 NA 3
#> 2: 2 3 1 NA 1 4
#> 3: 3 4 2 NA 3 5
#> 4: 4 5 2 NA 4 8
#> 5: 5 8 8 NA 5 12
#> 6: 6 12 2 NA 8 13
#> 7: 7 13 4 NA 12 20
#> 8: 8 20 6 NA 13 21
#> 9: 9 21 5 NA 20 25
#> 10: 10 25 5 NA 21 NA
# note the last 2 columns are simply the shifted values of x
Use of for loop
The following is an inefficient function using for
loop in R for bootstrapping a data.tale column.
# function using a for loop on each observation
func_loop <- function(dt){
# create a for loop for updating the newval column iteratively
for(i in seq_len(nrow(dt))[-c((nrow(dt) - c(0:1)))]){
dt[i 2,newval:=y] # temporary value to be erased later
dt[,new_pre:=shift(newval, n = 1)]
dt[,new_nxt:=shift(newval, n = -1)]
# the following line of code uses the previously computed value (new_pre)
dt[rn > 1,newval:=ifelse(rn==i 1, new_pre (new_nxt - new_pre)* (x - x_pre) /((x_nxt - x_pre)),newval) ]
dt[rn==i 2,newval:=NA_real_]
}
dt
}
Call the for - loop
function
# call the function
func_loop(dt)[]
#> rn x y newval x_pre x_nxt new_pre new_nxt
#> 1: 1 1 1 1.000000 NA 3 NA 1.666667
#> 2: 2 3 1 1.666667 1 4 1.000000 1.833333
#> 3: 3 4 2 1.833333 3 5 1.666667 3.375000
#> 4: 4 5 2 3.375000 4 8 1.833333 2.785714
#> 5: 5 8 8 2.785714 5 12 3.375000 3.757143
#> 6: 6 12 2 3.757143 8 13 2.785714 4.037500
#> 7: 7 13 4 4.037500 12 20 3.757143 4.879688
#> 8: 8 20 6 4.879688 13 21 4.037500 NA
#> 9: 9 21 5 4.903750 20 25 4.879688 5.000000
#> 10: 10 25 5 NA 21 NA NA NA
# benchmark the speed
microbenchmark::microbenchmark(func_loop(dt))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> func_loop(dt) 23.00165 24.24735 26.19917 25.11379 27.11327 39.43801 100
Created on 2022-07-19 by the reprex package (v2.0.1)
Expectedly this gives a terrible efficiency of 30 msec for 10 rows, which means for a million rows it will take 50 minutes. I have several million rows to be computed.
I am aware of froll*
series and use them extensively but here I am unable to apply frollapply
, since this algo has a dependency on previous computation.
I have tried data.table::set
also and that does'nt reduce drastically the time due to the fact that we have to call dt[]
repeatedly which is an expensive call. See Henrik's comments below.
I am looking to improve the performance by several orders of magnitude and not just 20 or 40%. I would expect a 1/10th or 1/50th of the current response times with a good vector algorithm.
CodePudding user response:
A simple Rcpp
function will be much faster.
library(data.table)
Rcpp::cppFunction(
"NumericVector iterInterp(const NumericVector& x, const NumericVector& y) {
const int n = x.size();
NumericVector newval(n);
newval(0) = y(0);
newval(n - 1) = NA_REAL;
for (int i = 1; i < n - 1; i ) {
newval(i) = newval(i - 1) (y(i 1) - newval(i - 1))*(x(i) - x(i - 1))/(x(i 1) - x(i - 1));
}
return newval;
}"
)
dt <- data.table(
x = c(1,3,4,5,8,12,13,20,21,25),
y = c(1,1,2,2,8,2,4,6,5,5)
)
microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)])
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> iterInterp 153.5 156.9 164.894 159.7 163.8 391.8 100
dt
#> x y newval
#> 1 1 1 1.000000
#> 2 3 1 1.666667
#> 3 4 2 1.833333
#> 4 5 2 3.375000
#> 5 8 8 2.785714
#> 6 12 2 3.757143
#> 7 13 4 4.037500
#> 8 20 6 4.879688
#> 9 21 5 4.903750
#> 10 25 5 NA
That comes out to < 3 minutes for 10M rows, except the overhead does not scale with the size of the data.table
, as shown by benchmarking:
dt <- data.table(
x = rep(c(1,3,4,5,8,12,13,20,21,25), 1e6) 25*rep(0:(1e6 - 1L), each = 10),
y = rep(c(1,1,2,2,8,2,4,6,5,5), 1e6)
)
microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)])
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> iterInterp 157.585 159.0541 178.3298 168.0882 172.2245 274.102 100
That's a fraction of a second for 10M rows.
CodePudding user response:
This frollapply
solution seems to get quite close to desired output:
dt[,newval_roll:=shift(frollapply(vol,n=n,align = 'center',FUN=median,fill=NA),-1)]
dt$newval==dt$newval_roll
#[1] NA NA NA NA FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[55] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[91] TRUE TRUE TRUE TRUE NA NA NA NA NA NA
Not sure I fully understand the approxfun
bit as it applies to already existing coordinates, meaning that no approximation is needed.
If filling of the first elements is needed, you could also use zoo::rollapply
which allows a partial
calculation. Other possibity is to run the loop only for the first elements.