How to speed up an R loop with sequential operations-CodePudding

I have a model which has multiple conditions and returns a value which it depends on for next prediction. Lets say given a time serie of A and B, the model returns a value of C variable, which in turn is used to estimate a value of D. In the next iteration along the new A and B, the model also uses estimated D as input:

df = data.frame(A = sample(-5:5, 10000, replace = TRUE),
                 B = sample(-5:5, 10000, replace = TRUE),
                 C = 0,
                D=0)

for(i in 1:nrow(df)){
  
    if (df$A[i]< 0 & df$B[i]>0){     
      df$C[i]<-df$B[i]
    
      } else if(df$A[i]==0  & df$B[i]==0 ){ 
      df$C[i]<-0
    
      }  else {
      df$C[i]<-df$A[i] df$B[i]-df$D[i]  
        }
    
    df$D[i 1]<-ifelse(df$D[i]<=-df$C[i],0,df$D[i] df$C[i]) # this is a cumulative sum-reset function
    
}

Though the code works well, it is very slow since I have hundred thousands of observations. I would appreciate for any suggestion that could speed it up.

CodePudding user response：

Since each row is dependent on the result of the previous row, this is difficult to write in such a way that one can take advantage of R's vectorization. In cases like this, we get a massive advantage in writing the code in Rcpp.

library(Rcpp)

cppFunction('

DataFrame f_Rcpp(DataFrame df) {

  NumericVector A = df["A"];
  NumericVector B = df["B"];
  NumericVector C = df["C"];
  NumericVector D = df["D"];

  for(int i = 0; i < (df.nrows() - 1);   i) {
    
    if (A[i] < 0 && B[i] > 0) {     
      C[i] = B[i];
      
    } else if(A[i] == 0 && B[i] == 0 ) { 
      C[i] = 0;
      
    }  else {
      C[i] = A[i]   B[i] - D[i];
    }
    
    if(D[i] <= -C[i]) {
    D[i 1] = 0;
    } else {
    D[i 1] = D[i]   C[i]; 
    }
  }
  return(df);
}
            
')

If we wrap your own code as a function so we can compare it, we see that our Rcpp function gives the same results:

f_R <- function(df) {
  for(i in 1:(nrow(df) - 1)) {
    
    if (df$A[i] < 0 & df$B[i] > 0) {     
      df$C[i] <- df$B[i]
      
    } else if(df$A[i] == 0 & df$B[i] == 0 ){ 
      df$C[i] <- 0
      
    }  else {
      df$C[i] <- df$A[i]   df$B[i] - df$D[i]  
    }
    
    df$D[i 1] <- ifelse(df$D[i] <= -df$C[i], 0, df$D[i]   df$C[i]) 
    
  }
  return(df)
}

res1 <- f_R(df)
res2 <- f_Rcpp(df)

identical(res1, res2)
#> [1] TRUE

But look what happens when we benchmark:

microbenchmark::microbenchmark(f_R(df), f_Rcpp(df), times = 10)
#> Unit: microseconds
#>       expr         min        lq         mean      median          uq         max neval cld
#>    f_R(df) 1746032.401 1793779.0 1794274.9209 1802222.051 1810686.801 1815285.001    10   b
#> f_Rcpp(df)     567.701     585.9     610.1607     601.851     642.801     650.101    10  a

The Rcpp function processes all 10,000 rows in less than a millisecond, as opposed to almost 2 seconds in basic R. The Rcpp version is almost 3,000 times faster.

CodePudding user response：

An alternative approach, if you don't mind using another library {dplyr}. Admittedly, this alternative, while (perhaps) more readable, is 200 times slower than @Allan Camerons Rcpp solution.

library(dplyr)

f_dplyr <- function(df){
    df |>
    mutate(C = ifelse(!any(A, B),
                      0,
               ifelse(A < 0 & B > 0,
                      B,
                      A   B - D
                      )
               ),
           lag_C = lag(C), ## default: lag by 1
           lag_D = lag(D)
           ) |>
    rowwise() |>
    mutate(D = ifelse(lag_D <= lag_C,
                      0,
                      sum(lag_C, lag_D, na.rm = TRUE)
                      )
           )
}

output:

> f_dplyr(df) |> head()
# A tibble: 6 x 6
# Rowwise: 
      A     B     C     D lag_C lag_D
  <int> <int> <dbl> <dbl> <dbl> <dbl>
1    -4    -2    -6    NA    NA    NA
2    -5    -2    -6    -6    -6     0
3     3     1    -6    -6    -6     0
4     1    -2    -6    -6    -6     0
5     4    -4    -6    -6    -6     0
6     4    -3    -6    -6    -6     0

speed:

> microbenchmark(f1(df), times = 10)
Unit: milliseconds
   expr      min       lq     mean   median      uq      max neval
 f1(df) 112.5365 115.7435 122.5075 122.0079 127.432 136.4511    10