Home > Enterprise >  birthday paradox function in R
birthday paradox function in R

Time:06-23

I'm a beginner in R and am trying to create a birthday paradox function and managed to reach this point, and the result is approximately 0.5, as expected.

k <- 23 
sims <- 1000 
event <- 0 
for (i in 1:sims) {
  days <- sample(1:365, k, replace = TRUE)
  days.unique <- unique(days) 
  if (length(days.unique) < k) {
    event <- event   1 } 
  answer <- event/sims}
  answer

However, when I tried to put that into a function, the result was always 0.001. Here is the code:

bdayfunction<- function(k){
 sims <- 1000 
 event <- 0 
 for (i in 1:sims) {
  days <- sample(1:365, k, replace = TRUE)
  days.unique <- unique(days) 
  if (length(days.unique) < k) {
    event <- event   1 } 
 answer <- event/sims
 return (answer)
 }
}

What have I done wrong?

CodePudding user response:

Your return is not in the right place: it is in the loop (the same holds for your answer calculation by the way).

This works:

bdayfunction<- function(k){
  sims <- 1000 
  event <- 0 
  for (i in 1:sims) {
    days <- sample(1:365, k, replace = TRUE)
    days.unique <- unique(days) 
    if (length(days.unique) < k) {
      event <- event   1 }   
  }
  answer <- event/sims
  return (answer)
}

In R, you can make use of libraries that allows you to do grouping operation. The two main ones are data.table and dplyr. Here, instead of doing a loop, you could try to create a long data.frame with all your simulations, to then calculate the unique number of days per simulation and then count the number of occurrence below k. With dplyr:

library(dplyr)

bdayfunction_dplyr <- function(k){  
  df <- data.frame(sim = rep(1:sims,each = k),
                   days = sample(1:365, k*sims, replace = TRUE))
  return(
    df %>%
    group_by(sim) %>%
    summarise(plouf = length(unique(days))< k) %>%
    summarise(out = sum(plouf)/1000) %>%
    pull(out)
    )  
}

In data.table:

library(data.table)

bdayfunction_data.table <- function(k){
  dt <- data.table(sim = rep(1:sims,each = k),
                   days = sample(1:365, k*sims, replace = TRUE))

  return(dt[,length(unique(days)),sim][V1<k,.N/1000])
}

You can test that they provide the same result:

set.seed(123)
bdayfunction(23)
[1] 0.515

set.seed(123)
bdayfunction_dplyr(23)
[1] 0.515

set.seed(123)
bdayfunction_data.table(23)
[1] 0.515

Now lets compare the speed:

library(microbenchmark)

microbenchmark(initial = bdayfunction(23),
               dplyr = bdayfunction_dplyr(23),
               data.table = bdayfunction_data.table(23))

Unit: milliseconds
       expr     min       lq      mean  median       uq      max neval cld
    initial  7.3252  7.56900  8.435564  7.7441  8.15995  24.7681   100  a 
      dplyr 12.3488 12.96285 16.846118 13.3777 14.71370 295.6716   100   b
 data.table  5.9186  6.24115  6.540183  6.4494  6.75640   8.1466   100  a 

You see that data.table is slightly faster than your initial loop, and shorter to write.

  • Related