Home > Net >  How to speed up an R loop with sequential operations
How to speed up an R loop with sequential operations

Time:12-27

I have a model which has multiple conditions and returns a value which it depends on for next prediction. Lets say given a time serie of A and B, the model returns a value of C variable, which in turn is used to estimate a value of D. In the next iteration along the new A and B, the model also uses estimated D as input:

df = data.frame(A = sample(-5:5, 10000, replace = TRUE),
                 B = sample(-5:5, 10000, replace = TRUE),
                 C = 0,
                D=0)

for(i in 1:nrow(df)){
  
    if (df$A[i]< 0 & df$B[i]>0){     
      df$C[i]<-df$B[i]
    
      } else if(df$A[i]==0  & df$B[i]==0 ){ 
      df$C[i]<-0
    
      }  else {
      df$C[i]<-df$A[i] df$B[i]-df$D[i]  
        }
    
    df$D[i 1]<-ifelse(df$D[i]<=-df$C[i],0,df$D[i] df$C[i]) # this is a cumulative sum-reset function
    
}

Though the code works well, it is very slow since I have hundred thousands of observations. I would appreciate for any suggestion that could speed it up.

CodePudding user response:

Since each row is dependent on the result of the previous row, this is difficult to write in such a way that one can take advantage of R's vectorization. In cases like this, we get a massive advantage in writing the code in Rcpp.

library(Rcpp)

cppFunction('

DataFrame f_Rcpp(DataFrame df) {

  NumericVector A = df["A"];
  NumericVector B = df["B"];
  NumericVector C = df["C"];
  NumericVector D = df["D"];

  for(int i = 0; i < (df.nrows() - 1);   i) {
    
    if (A[i] < 0 && B[i] > 0) {     
      C[i] = B[i];
      
    } else if(A[i] == 0 && B[i] == 0 ) { 
      C[i] = 0;
      
    }  else {
      C[i] = A[i]   B[i] - D[i];
    }
    
    if(D[i] <= -C[i]) {
    D[i 1] = 0;
    } else {
    D[i 1] = D[i]   C[i]; 
    }
  }
  return(df);
}
            
')

If we wrap your own code as a function so we can compare it, we see that our Rcpp function gives the same results:

f_R <- function(df) {
  for(i in 1:(nrow(df) - 1)) {
    
    if (df$A[i] < 0 & df$B[i] > 0) {     
      df$C[i] <- df$B[i]
      
    } else if(df$A[i] == 0 & df$B[i] == 0 ){ 
      df$C[i] <- 0
      
    }  else {
      df$C[i] <- df$A[i]   df$B[i] - df$D[i]  
    }
    
    df$D[i 1] <- ifelse(df$D[i] <= -df$C[i], 0, df$D[i]   df$C[i]) 
    
  }
  return(df)
}

res1 <- f_R(df)
res2 <- f_Rcpp(df)

identical(res1, res2)
#> [1] TRUE

But look what happens when we benchmark:

microbenchmark::microbenchmark(f_R(df), f_Rcpp(df), times = 10)
#> Unit: microseconds
#>       expr         min        lq         mean      median          uq         max neval cld
#>    f_R(df) 1746032.401 1793779.0 1794274.9209 1802222.051 1810686.801 1815285.001    10   b
#> f_Rcpp(df)     567.701     585.9     610.1607     601.851     642.801     650.101    10  a 

The Rcpp function processes all 10,000 rows in less than a millisecond, as opposed to almost 2 seconds in basic R. The Rcpp version is almost 3,000 times faster.

CodePudding user response:

An alternative approach, if you don't mind using another library {dplyr}. Admittedly, this alternative, while (perhaps) more readable, is 200 times slower than @Allan Camerons Rcpp solution.

library(dplyr)

f_dplyr <- function(df){
    df |>
    mutate(C = ifelse(!any(A, B),
                      0,
               ifelse(A < 0 & B > 0,
                      B,
                      A   B - D
                      )
               ),
           lag_C = lag(C), ## default: lag by 1
           lag_D = lag(D)
           ) |>
    rowwise() |>
    mutate(D = ifelse(lag_D <= lag_C,
                      0,
                      sum(lag_C, lag_D, na.rm = TRUE)
                      )
           )
}

output:

> f_dplyr(df) |> head()
# A tibble: 6 x 6
# Rowwise: 
      A     B     C     D lag_C lag_D
  <int> <int> <dbl> <dbl> <dbl> <dbl>
1    -4    -2    -6    NA    NA    NA
2    -5    -2    -6    -6    -6     0
3     3     1    -6    -6    -6     0
4     1    -2    -6    -6    -6     0
5     4    -4    -6    -6    -6     0
6     4    -3    -6    -6    -6     0

speed:

> microbenchmark(f1(df), times = 10)
Unit: milliseconds
   expr      min       lq     mean   median      uq      max neval
 f1(df) 112.5365 115.7435 122.5075 122.0079 127.432 136.4511    10
  • Related