Home > other >  R: Counting the Number of Times a Condition Appears in a Data Frame
R: Counting the Number of Times a Condition Appears in a Data Frame

Time:11-22

I am working with the R programming language. I thought of the following question I would like to test:

  • Suppose there is a coin that has a 5% chance of landing on HEADS and a 95% chance of landing on tails

Based on a computer simulation, I want to find out the following :

  • The minimum number of flips before observing HEADS, TAILS, HEADS

  • The average number of flips before observing HEADS, TAILS, HEADS

  • The maximum number of flips before observing HEADS, TAILS, HEADS

I tried to write a simulation in R that flips this coin three times, I call this a "run". The simulation then performs 100 "runs":

results <- list()

for (i in 1:100){

response_i <- c("H","T")
response_i <- sample(response_i, 3, replace=TRUE, 
                        prob=c(0.05, 0.95))
response_i <- as.factor(response_i)

iteration_i = i

run_i = data.frame(response_i, iteration_i)

 results[[i]] <- run_i

}

This looks as follows (e.g. run #22, #23, #24):

[[22]]
  response_i iteration_i
1          T          22
2          T          22
3          T          22

[[23]]
  response_i iteration_i
1          H          23
2          T          23
3          T          23

[[24]]
  response_i iteration_i
1          T          24
2          T          24
3          T          24

My Question:

I would like to modify the above code so that:

1) The simulation automatically stops after you see the first H, T, H (right now, I run the simulation for 100 runs, and hope this is enough runs to observe at least one H, T, T)

2) Once the first H, T, H appears and the simulation automatically stops, I would like to record at which "run" this took place (i.e. what was the value of "iteration_i"? )

3) I would then like to repeat this entire simulation 100 times (100 runs * 100 times = 10,000 coin flips)

Once this is completed, I will be able to make histograms which show the minimum number of coin flips, the average number of coin flips and the maximum number of coin flips before H, T, H was observed (using "ggplot2", I don't think this will be too difficult).

#sample data
number_of_runs_before_HTH_appeared = c(15,10, 11, 8, 12, 21, 32, 7, 9, 20, 22, 3, 16, 7 )

hist(number_of_runs_before_HTH_appeared, main = "Number of Runs before HTH")

enter image description here

But can someone please help me format/modify my code to simplify what I am trying to achieve? At the moment, I used the following R code to "tidy up" the 100 "runs" from the first simulation:

results_df <- do.call(rbind.data.frame, results)

 head(results_df)
  response_i iteration_i
1          T           1
2          T           1
3          T           1
4          T           2
5          T           2
6          T           2

I then manually repeated this many times, e..g

results_df_1 <- do.call(rbind.data.frame, results)
results_df_1$index = 1

#re-run original simulation
results_df_2<- do.call(rbind.data.frame, results)
results_df_2$index = 2

#re-run original simulation (many times)
results_df_n<- do.call(rbind.data.frame, results)
results_df_n$index = n

final <- data.frame(results_df_1, results_df_2, results_df_n)

I then imported this "final" file into Microsoft Excel and tried to manually collapse the "final file" to answer my three original questions - but I was hoping that someone could show me how to do this by modifying my original code in R.

Can someone please help me with this?

Thanks!

CodePudding user response:

This might get you started

library(data.table)
n <- 100
simulations <- 100
# build data.table with simulations
DT <- data.table(sim = rep(1:simulations, each = n),
                 step = 1:n,
                 flip = lapply(
                   lapply(seq.int(n * simulations), 
                          function(x) sample(c("H","T"), 3, replace=TRUE, prob=c(0.05, 0.95))),
                   paste0, collapse = ""))
#        sim step flip
#     1:   1    1  TTT
#     2:   1    2  TTT
#     3:   1    3  TTT
#     4:   1    4  TTT
#     5:   1    5  TTT
#    ---              
#  9996: 100   96  THT
#  9997: 100   97  TTT
#  9998: 100   98  TTT
#  9999: 100   99  TTT
# 10000: 100  100  TTT

DT
# find the first step where flip == "HTH" for each sim
DT[flip == "HTH", .(first_hth = min(step)), by = .(sim)]
#     sim first_hth
#  1:   2        88
#  2:   6        30
#  3:   7        16
#  4:  19        38
#  5:  26        42
#  6:  28        91
#  7:  34        87
#  8:  36        96
#  9:  38        98
# 10:  45        74
# 11:  48        93
# 12:  51        46
# 13:  53        30
# 14:  60        70
# 15:  68        27
# 16:  69        53
# 17:  88        35
# 18:  91        88
# 19:  94        74
# 20:  95       100
# 21:  97        81
# 22:  98        24
#     sim first_hth

CodePudding user response:

A combination of for and while loop will serve your purpose.

for loop will run the times you want to repeat the test and while loop will help you to stop the test when the first H T H combination occurs.

Below is the code you can use and proceed with the further steps.

results <- list()

for (j in 1:100) {

  response_i <- ''
  i <- 1

  while (response_i != 'HTH') {
    response_i <- c("H","T")
    response_i <- sample(response_i, 3, replace=TRUE, 
                         prob=c(0.05, 0.95))
    response_i <- paste(response_i, collapse = '')

    iteration_i = i
    if (response_i == 'HTH') {
      run_i = data.frame(response_i, iteration_i)
      results[[j]] <- run_i
      }
    i <- i   1
  }
}

data <- do.call('rbind', results)

head(data)

      response_i iteration_i
1        HTH        1206
2        HTH          46
3        HTH         116
4        HTH        1633
5        HTH         889
6        HTH          43
  • Related