I am working with the R programming language. I am trying to simulate the following "game:
- There is a population of 100 units
- You randomly sample 10 of these units, record the id's of the units you saw, and then put them back into the population
- You then take a second sample, record the id's of the units you saw in this second sample along with the first sample, and then put the second sample back into the population
- Repeat this many times
I wrote the following code in R that performs the above procedure:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
The result looks something like this:
> summary_file
iteration_i Count Cumulative unobserved
1 1 10 10 90
2 2 8 18 82
3 3 9 27 73
4 4 8 35 65
5 5 6 41 59
6 6 5 46 54
7 7 7 53 47
8 8 7 60 40
9 9 4 64 36
10 10 3 67 33
11 11 4 71 29
12 12 4 75 25
13 13 1 76 24
14 14 4 80 20
15 15 1 81 19
16 16 2 83 17
17 17 2 85 15
18 18 1 86 14
19 20 1 87 13
20 22 1 88 12
21 23 2 90 10
22 24 1 91 9
23 25 1 92 8
24 27 2 94 6
25 28 1 95 5
26 30 1 96 4
27 35 1 97 3
28 37 1 98 2
29 44 1 99 1
30 46 1 100 0
I would now like to repeat this "game" many times.
I would like to keep the "summary_file" for each "game" (e.g. summary_file_1, summary_file_2, summary_file_3, etc.)
I would then like to create a "total" summary file that shows the number of iterations that were required in each game to observe all units.
This total_summary_file would look something like this:
game_id iterations_required
1 game_1 47
2 game_2 45
3 game_3 44
4 game_4 42
5 game_5 42
Currently, I am just copy/pasting my earlier code several times and storing the results, then I append everything at the end and calculate the summary statistics - but I am trying to find a way to "loop the loop" and do everything at once. I do not know if it is possible to introduce a command like "results_df_i <- do.call(rbind.data.frame, results_i)"
into the loop and efficiently create everything at the same time instead of manually copy/pasting the earlier loop.
Can someone please show me how to do this?
Thanks!
Note: Ideally, I would like to keep as much of the same code I have used already.
CodePudding user response:
You're making this a lot less efficient than it could be. To get, say, 100 repeated samples of 10 from the set 1:100 (with replacement), we can do replicate(100, sample(100, 10, TRUE))
.
We can then coerce this into a vector and count the number of unique values every 10 entries along the vector until we get to 100. This gives us the number of iterations required to exhaust the samples.
If we put this inside an sapply
, we don't even need an explicit loop, which means we can create the results data frame in a single call:
set.seed(1)
n_games <- 10
results <- data.frame(game_id = paste("game", seq(n_games), sep = "_"),
iterations_required = sapply(seq(n_games), function(x) {
samp <- c(replicate(100, sample(100, 10, TRUE)))
sum(sapply(1:100 * 10, function(n) length(unique(samp[1:n]))) < 100)
}))
results
#> game_id iterations_required
#> 1 game_1 59
#> 2 game_2 44
#> 3 game_3 54
#> 4 game_4 59
#> 5 game_5 57
#> 6 game_6 58
#> 7 game_7 96
#> 8 game_8 60
#> 9 game_9 71
#> 10 game_10 33
Created on 2022-06-11 by the reprex package (v2.0.1)
CodePudding user response:
Using Markov chains, we can produce the cumulative distribution function for the number of iterations required for a game (up to machine precision). The resulting CDF can be sampled directly using findInterval
.
We can simplify things slightly by starting with the second iteration, since the first iteration will always result in 90 unseen units.
First, set up a matrix for all possible transitions:
m <- matrix(c(rep(90:1, each = 11), sequence(rep(11,90), 90:1, -1)), ncol = 2, dimnames = list(NULL, c("from", "to")))
m <- m[m[,2] >= 0L,]
Then create a transition matrix with row 1 representing the state where all units have been seen and row 91 representing the state where 10 units have been seen:
mTrans <- matrix(0, 91, 91)
The number of previously unseen units selected follows the hypergeometric distribution.
mTrans[m 1L] <- dhyper(m[,1] - m[,2], m[,1], 100L - m[,1], 10L)
Row 1 represents an absorbing state since all units have been seen.
mTrans[1, 1] <- 1
mTrans
contains the probabilities of each state after the second iteration.
Initialize a while loop and calculate the CDF.
mm <- mTrans %*% mTrans
maxIter <- 1000L
p <- numeric(maxIter)
iter <- 3L
while (p[iter] < 1) {
if ((iter <- iter 1L) > maxIter) {
p <- c(p, numeric(maxIter))
maxIter <- maxIter*2L
}
mm <- mm %*% mTrans
p[iter] <- mm[91, 1]
}
p <- p[1:iter]
iter
#> [1] 345
Machine precision limits the CDF to less than 345 iterations. Plot the CDF:
plot(p, xlab = "iterations", ylab = "cumulative probability")
Using findInterval
we can quickly generate a large number of random samples of the iterations required.
ngames <- 1e6L # one million games
results <- data.frame(game_id = 1:ngames, iterations_required = findInterval(runif(ngames), p))
head(results)
#> game_id iterations_required
#> 1 1 73
#> 2 2 69
#> 3 3 40
#> 4 4 41
#> 5 5 44
#> 6 6 43
Get a histogram of the sample number of iterations required.
hist(results$iterations_required)