Home > Enterprise >  Calculate cumulative mean for dataset randomized 100 times
Calculate cumulative mean for dataset randomized 100 times

Time:09-23

I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.

# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>% 
      as.data.frame(df)

I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.

df2 <- df %>% 
       mutate(cm = cummean(val))

However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.

I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.

Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!

CodePudding user response:

Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.

library(tidyverse)

set.seed(2000)

num_iterations <- 100
num_sample <- 100

1:num_iterations %>%
  map_dfr(
    function(i) {
      tibble(
        iteration = i, 
        id = 1:num_sample, 
        val = rnorm(num_sample), 
        cm = cummean(val), 
        cm2 = cumsum(val) / seq_along(val)
      )
    }
  )

CodePudding user response:

You can mutate to create 100 samples then call cummean:

library(dplyr)
library(purrr)

df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))

CodePudding user response:

We may use rerun from purrr

library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
          dat %>% 
            sample_n(size = n()) %>%
             mutate(cm = cummean({{valcol}}))
   }
n <- 100
out <- rerun(n, f1(df, val))

The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows

out1 <- bind_rows(out, .id = 'ID')
> head(out1)
  ID        val          cm
1  1  0.3376980  0.33769804
2  1 -1.5699384 -0.61612019
3  1  1.3387892  0.03551628
4  1  0.2409634  0.08687807
5  1  0.7373232  0.21696708
6  1 -0.8012491  0.04726439
  • Related