I'm trying to speed-up this calculation and at least get rid of the inner loop, because it's really slow over a lot of ids. For each ID, I have a vector of "x" values and am trying to calculate "y" values that are dependent on the current value of "x" and the previous value of "y". In other words, for each ID y[1] = x[1] and for subsequent rows, y[i]=y[i-1]*x[i].
Can anyone help me speed this up? I have tried to find solutions but can't figure it out.
Here is some reproducible code that does what I want, but slowly:
#Calculate y[i]=y[i-1]*x[i]
#Make some dummy data
dat<-data.frame(id=c(rep(1,10), rep(2,10)),
x=c(0.95,1,1,1,0.98,0.96,0.93,1,0.94,0.9,
0.97,1,1,0.94,0.93,1,1,0.97,0.99,0.95),
y=rep(0,20))
#Calculate y[i]=x[i]*y[i-1]
idx<-1
#Loop over individuals
ids<-unique(dat$id)
for (j in 1:length(ids)){
#Extract data for individual i
temp<-dat[dat$id==ids[j],]
#Initialize y[1]=x[1]
temp$y[1]<-temp$x[1]
#Calculate y[i]=x[i]*y[i-1] for remaining rows starting with row 2
for (i in 2:nrow(temp)){
temp$y[i]<-temp$x[i]*temp$y[i-1]
}
#Store results in "dat"
dat$y[idx:(nrow(temp) idx-1)]<-temp$y
idx<-nrow(temp) 1
rm(temp)
}
CodePudding user response:
Here's an option where you pivot the data wider and then can do the calculations for all IDs at once, so you just have to loop over the number of observations in each ID, and not each ID as well.
xmat <- dat %>%
group_by(id) %>%
mutate(obs=seq_along(id)) %>%
select(-y) %>%
pivot_wider(names_from="id", values_from="x", names_sep="_") %>%
as.matrix() %>%
.[,-1]
ymat <- xmat[1, , drop=FALSE]
for(i in 2:nrow(xmat)){
ymat <- rbind(ymat, xmat[i,]*ymat[(i-1), ])
}
colnames(ymat) <- paste0("y_", colnames(ymat))
colnames(xmat) <- paste0("x_", colnames(xmat))
mats <- as.data.frame(cbind(xmat, ymat))
mats %>% pivot_longer(everything(), names_pattern="(.*)_(.*)", names_to=c(".value", "id"))
Whether and how much this speeds things up depends on how much data you have. Here are some benchmarks. In the code below the one labeled loops is your original code and the one labeled pivot is my code. With your original data that had 20 rows, the pivoting is way slower:
microbenchmark(pivot=f1(), loops=f2(), times = 250)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# pivot 8861.625 9032.000 9734.104 9164.167 9768.792 25119.918 250 b
# loops 223.293 235.334 258.029 261.334 271.876 398.293 250 a
However, as you get more data it gets faster relative to the loops. Here's with 100 IDs and 20 observations per ID:
microbenchmark(pivot=f1(), loops=f2(), times = 25)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# pivot 11.13217 12.22788 14.35621 12.58250 13.08083 35.59029 25 a
# loops 17.23871 18.42488 18.60614 18.54304 19.08754 20.24063 25 b
Here, there is not much gain, but the pivoting way is slightly faster on average. With 1000 IDs and 20 obs per ID, the pivoting way is much faster:
microbenchmark(pivot=f1(), loops=f2(), times = 25)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# pivot 30.25879 31.13842 32.52211 31.31271 31.97708 46.16475 25 a
# loops 270.93421 292.02204 310.56463 293.39604 306.38221 605.06092 25 b
With 1000 IDs the pivoting way is roughly 10x faster than the loops. All this is to say that if you've got tons of data, the pivoting way might make sense.
CodePudding user response:
library(tidyverse)
dat <- data.frame(id=c(rep(1,10), rep(2,10)),
x=c(0.95,1,1,1,0.98,0.96,0.93,1,0.94,0.9,
0.97,1,1,0.94,0.93,1,1,0.97,0.99,0.95),
y=rep(0,20)) %>%
as_tibble()
dat %>%
group_by(id) %>%
mutate(y = case_when(row_number() == 1 ~ x,
TRUE ~ cumprod(y x)))
#> # A tibble: 20 × 3
#> # Groups: id [2]
#> id x y
#> <dbl> <dbl> <dbl>
#> 1 1 0.95 0.95
#> 2 1 1 0.95
#> 3 1 1 0.95
#> 4 1 1 0.95
#> 5 1 0.98 0.931
#> 6 1 0.96 0.894
#> 7 1 0.93 0.831
#> 8 1 1 0.831
#> 9 1 0.94 0.781
#> 10 1 0.9 0.703
#> 11 2 0.97 0.97
#> 12 2 1 0.97
#> 13 2 1 0.97
#> 14 2 0.94 0.912
#> 15 2 0.93 0.848
#> 16 2 1 0.848
#> 17 2 1 0.848
#> 18 2 0.97 0.823
#> 19 2 0.99 0.814
#> 20 2 0.95 0.774
Created on 2022-06-29 by the reprex package (v2.0.1)