Calculate cumulative mean for dataset randomized 100 times-CodePudding

I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.

# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>% 
      as.data.frame(df)

I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.

df2 <- df %>% 
       mutate(cm = cummean(val))

However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.

I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.

Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!

CodePudding user response：

Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.

library(tidyverse)

set.seed(2000)

num_iterations <- 100
num_sample <- 100

1:num_iterations %>%
  map_dfr(
    function(i) {
      tibble(
        iteration = i, 
        id = 1:num_sample, 
        val = rnorm(num_sample), 
        cm = cummean(val), 
        cm2 = cumsum(val) / seq_along(val)
      )
    }
  )

CodePudding user response：

You can mutate to create 100 samples then call cummean:

library(dplyr)
library(purrr)

df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))

CodePudding user response：

We may use rerun from purrr

library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
          dat %>% 
            sample_n(size = n()) %>%
             mutate(cm = cummean({{valcol}}))
   }
n <- 100
out <- rerun(n, f1(df, val))

The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows

out1 <- bind_rows(out, .id = 'ID')
> head(out1)
  ID        val          cm
1  1  0.3376980  0.33769804
2  1 -1.5699384 -0.61612019
3  1  1.3387892  0.03551628
4  1  0.2409634  0.08687807
5  1  0.7373232  0.21696708
6  1 -0.8012491  0.04726439