R: Looping a Loop-CodePudding

I am working with the R programming language. I am trying to simulate the following "game:

There is a population of 100 units
You randomly sample 10 of these units, record the id's of the units you saw, and then put them back into the population
You then take a second sample, record the id's of the units you saw in this second sample along with the first sample, and then put the second sample back into the population
Repeat this many times

I wrote the following code in R that performs the above procedure:

library(dplyr)

var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)

my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100


results <- list()
results2<- list()

for (i in 1:100)
    
{
    
    iteration_i = i
    
    sample_i = my_data[sample(nrow(my_data), 10), ]
    
    
    results_tmp = data.frame(iteration_i, sample_i)
    
    results[[i]] <- results_tmp
    
}

results_df <- do.call(rbind.data.frame, results)

test_1 <- data.frame(results_df %>% 
    group_by(id) %>% 
    filter(iteration_i == min(iteration_i)) %>% 
    distinct)


summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))

cumulative = cumsum(summary_file$Count)

summary_file$Cumulative = cumulative

summary_file$unobserved = 100 - cumulative

The result looks something like this:

> summary_file
   iteration_i Count Cumulative unobserved
1            1    10         10         90
2            2     8         18         82
3            3     9         27         73
4            4     8         35         65
5            5     6         41         59
6            6     5         46         54
7            7     7         53         47
8            8     7         60         40
9            9     4         64         36
10          10     3         67         33
11          11     4         71         29
12          12     4         75         25
13          13     1         76         24
14          14     4         80         20
15          15     1         81         19
16          16     2         83         17
17          17     2         85         15
18          18     1         86         14
19          20     1         87         13
20          22     1         88         12
21          23     2         90         10
22          24     1         91          9
23          25     1         92          8
24          27     2         94          6
25          28     1         95          5
26          30     1         96          4
27          35     1         97          3
28          37     1         98          2
29          44     1         99          1
30          46     1        100          0

I would now like to repeat this "game" many times.

I would like to keep the "summary_file" for each "game" (e.g. summary_file_1, summary_file_2, summary_file_3, etc.)
I would then like to create a "total" summary file that shows the number of iterations that were required in each game to observe all units.

This total_summary_file would look something like this:

 game_id iterations_required
1  game_1                  47
2  game_2                  45
3  game_3                  44
4  game_4                  42
5  game_5                  42

Currently, I am just copy/pasting my earlier code several times and storing the results, then I append everything at the end and calculate the summary statistics - but I am trying to find a way to "loop the loop" and do everything at once. I do not know if it is possible to introduce a command like "results_df_i <- do.call(rbind.data.frame, results_i)" into the loop and efficiently create everything at the same time instead of manually copy/pasting the earlier loop.

Can someone please show me how to do this?

Thanks!

Note: Ideally, I would like to keep as much of the same code I have used already.

CodePudding user response：

You're making this a lot less efficient than it could be. To get, say, 100 repeated samples of 10 from the set 1:100 (with replacement), we can do replicate(100, sample(100, 10, TRUE)).

We can then coerce this into a vector and count the number of unique values every 10 entries along the vector until we get to 100. This gives us the number of iterations required to exhaust the samples.

If we put this inside an sapply, we don't even need an explicit loop, which means we can create the results data frame in a single call:

set.seed(1)

n_games <- 10

results <- data.frame(game_id = paste("game", seq(n_games), sep = "_"),
           iterations_required = sapply(seq(n_games), function(x) {
  samp <- c(replicate(100, sample(100, 10, TRUE)))
  sum(sapply(1:100 * 10, function(n) length(unique(samp[1:n]))) < 100)
  }))

results
#>    game_id iterations_required
#> 1   game_1                  59
#> 2   game_2                  44
#> 3   game_3                  54
#> 4   game_4                  59
#> 5   game_5                  57
#> 6   game_6                  58
#> 7   game_7                  96
#> 8   game_8                  60
#> 9   game_9                  71
#> 10 game_10                  33

^{Created on 2022-06-11 by the reprex package (v2.0.1)}

CodePudding user response：

Using Markov chains, we can produce the cumulative distribution function for the number of iterations required for a game (up to machine precision). The resulting CDF can be sampled directly using findInterval.

We can simplify things slightly by starting with the second iteration, since the first iteration will always result in 90 unseen units.

First, set up a matrix for all possible transitions:

m <- matrix(c(rep(90:1, each = 11), sequence(rep(11,90), 90:1, -1)), ncol = 2, dimnames = list(NULL, c("from", "to")))
m <- m[m[,2] >= 0L,]

Then create a transition matrix with row 1 representing the state where all units have been seen and row 91 representing the state where 10 units have been seen:

mTrans <- matrix(0, 91, 91)

The number of previously unseen units selected follows the hypergeometric distribution.

mTrans[m   1L] <- dhyper(m[,1] - m[,2], m[,1], 100L - m[,1], 10L)

Row 1 represents an absorbing state since all units have been seen.

mTrans[1, 1] <- 1

mTrans contains the probabilities of each state after the second iteration.

Initialize a while loop and calculate the CDF.

mm <- mTrans %*% mTrans
maxIter <- 1000L
p <- numeric(maxIter)
iter <- 3L

while (p[iter] < 1) {
  if ((iter <- iter   1L) > maxIter) {
    p <- c(p, numeric(maxIter))
    maxIter <- maxIter*2L
  }
  
  mm <- mm %*% mTrans
  p[iter] <- mm[91, 1]
}

p <- p[1:iter]
iter
#> [1] 345

Machine precision limits the CDF to less than 345 iterations. Plot the CDF:

plot(p, xlab = "iterations", ylab = "cumulative probability")

Using findInterval we can quickly generate a large number of random samples of the iterations required.

ngames <- 1e6L # one million games
results <- data.frame(game_id = 1:ngames, iterations_required = findInterval(runif(ngames), p))
head(results)
#>   game_id iterations_required
#> 1       1                  73
#> 2       2                  69
#> 3       3                  40
#> 4       4                  41
#> 5       5                  44
#> 6       6                  43

Get a histogram of the sample number of iterations required.

hist(results$iterations_required)