How to speed up calculation of a value that depends on the value in the previous row?-CodePudding

I'm trying to speed-up this calculation and at least get rid of the inner loop, because it's really slow over a lot of ids. For each ID, I have a vector of "x" values and am trying to calculate "y" values that are dependent on the current value of "x" and the previous value of "y". In other words, for each ID y[1] = x[1] and for subsequent rows, y[i]=y[i-1]*x[i].

Can anyone help me speed this up? I have tried to find solutions but can't figure it out.

Here is some reproducible code that does what I want, but slowly:

#Calculate y[i]=y[i-1]*x[i]

#Make some dummy data
  dat<-data.frame(id=c(rep(1,10), rep(2,10)), 
                  x=c(0.95,1,1,1,0.98,0.96,0.93,1,0.94,0.9, 
                  0.97,1,1,0.94,0.93,1,1,0.97,0.99,0.95),
                  y=rep(0,20))
                


#Calculate y[i]=x[i]*y[i-1]
    idx<-1
    #Loop over individuals 
    ids<-unique(dat$id)
      for (j in 1:length(ids)){
        #Extract data for individual i
        temp<-dat[dat$id==ids[j],]
        #Initialize y[1]=x[1]
        temp$y[1]<-temp$x[1]
        #Calculate y[i]=x[i]*y[i-1] for remaining rows starting with row 2
        for (i in 2:nrow(temp)){
          temp$y[i]<-temp$x[i]*temp$y[i-1]
        }  
        #Store results in "dat"
        dat$y[idx:(nrow(temp) idx-1)]<-temp$y
        idx<-nrow(temp) 1
        rm(temp)
      }

CodePudding user response：

Here's an option where you pivot the data wider and then can do the calculations for all IDs at once, so you just have to loop over the number of observations in each ID, and not each ID as well.

xmat <- dat %>% 
  group_by(id) %>% 
  mutate(obs=seq_along(id)) %>% 
  select(-y) %>% 
  pivot_wider(names_from="id", values_from="x", names_sep="_") %>% 
  as.matrix() %>%
  .[,-1]
ymat <- xmat[1, , drop=FALSE]
for(i in 2:nrow(xmat)){
  ymat <- rbind(ymat, xmat[i,]*ymat[(i-1), ])
}

colnames(ymat) <- paste0("y_", colnames(ymat))
colnames(xmat) <- paste0("x_", colnames(xmat))
mats <- as.data.frame(cbind(xmat, ymat))
mats %>% pivot_longer(everything(), names_pattern="(.*)_(.*)", names_to=c(".value", "id"))

Whether and how much this speeds things up depends on how much data you have. Here are some benchmarks. In the code below the one labeled loops is your original code and the one labeled pivot is my code. With your original data that had 20 rows, the pivoting is way slower:

microbenchmark(pivot=f1(), loops=f2(), times = 250)
#Unit: microseconds
#  expr      min       lq     mean   median       uq       max neval cld
# pivot 8861.625 9032.000 9734.104 9164.167 9768.792 25119.918   250   b
# loops  223.293  235.334  258.029  261.334  271.876   398.293   250  a

However, as you get more data it gets faster relative to the loops. Here's with 100 IDs and 20 observations per ID:

microbenchmark(pivot=f1(), loops=f2(), times = 25)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
# pivot 11.13217 12.22788 14.35621 12.58250 13.08083 35.59029    25  a 
# loops 17.23871 18.42488 18.60614 18.54304 19.08754 20.24063    25   b

Here, there is not much gain, but the pivoting way is slightly faster on average. With 1000 IDs and 20 obs per ID, the pivoting way is much faster:

microbenchmark(pivot=f1(), loops=f2(), times = 25)
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval cld
# pivot  30.25879  31.13842  32.52211  31.31271  31.97708  46.16475    25  a 
# loops 270.93421 292.02204 310.56463 293.39604 306.38221 605.06092    25   b

With 1000 IDs the pivoting way is roughly 10x faster than the loops. All this is to say that if you've got tons of data, the pivoting way might make sense.

CodePudding user response：

library(tidyverse)

dat <- data.frame(id=c(rep(1,10), rep(2,10)), 
                  x=c(0.95,1,1,1,0.98,0.96,0.93,1,0.94,0.9, 
                      0.97,1,1,0.94,0.93,1,1,0.97,0.99,0.95),
                  y=rep(0,20)) %>% 
  as_tibble()

dat %>%
  group_by(id) %>% 
  mutate(y = case_when(row_number() == 1 ~ x, 
                       TRUE ~ cumprod(y   x)))
#> # A tibble: 20 × 3
#> # Groups:   id [2]
#>       id     x     y
#>    <dbl> <dbl> <dbl>
#>  1     1  0.95 0.95 
#>  2     1  1    0.95 
#>  3     1  1    0.95 
#>  4     1  1    0.95 
#>  5     1  0.98 0.931
#>  6     1  0.96 0.894
#>  7     1  0.93 0.831
#>  8     1  1    0.831
#>  9     1  0.94 0.781
#> 10     1  0.9  0.703
#> 11     2  0.97 0.97 
#> 12     2  1    0.97 
#> 13     2  1    0.97 
#> 14     2  0.94 0.912
#> 15     2  0.93 0.848
#> 16     2  1    0.848
#> 17     2  1    0.848
#> 18     2  0.97 0.823
#> 19     2  0.99 0.814
#> 20     2  0.95 0.774

^{Created on 2022-06-29 by the reprex package (v2.0.1)}