Home > Back-end >  R simulating sample with unique observations
R simulating sample with unique observations

Time:09-22

I have a dataset with two grouping variables and one outcome variable. I am trying to simulate sampling from this dataset, but I only want samples where no ID from either variable repeats.

My data is structured like this but with a few hundred rows:

structure(list(wteam = c("a", "a", "b", "c", "c", "d" ), week = c(1, 1, 1, 2, 2, 2), dif = c(0.649077088, 0.089812768, 0.173061282, 0.362544332, 0.459545808, 0.331745704)), row.names = c(NA, 6L), class = "data.frame")

I'm trying to sample 23 observations from the dataset such that neither wteam nor week can are repeated in the sample.

My current approach is horribly inefficient:

    sims<-10
    weeks<-23
    outs<-as.data.frame(matrix(0,ncol = sims, nrow = weeks))    
    start<-as.data.frame(matrix(0,ncol = 3, nrow = 23))
    names(start)<-names(nfl_cur)
    
    for(i in 1:sims) {

    start[1,]<- nfl_cur %>% sample_n(1)
    nfl_cur2 <- subset(nfl_cur, !(wteam %in% start$wteam))
    nfl_cur2 <- subset(nfl_cur, !(week %in% start$week))

    start[2,]<-nfl_cur2 %>% sample_n(1)
    nfl_cur3 <- subset(nfl_cur2, !(wteam %in% start$wteam))
    nfl_cur3 <- subset(nfl_cur2, !(week %in% start$week))
    
    start[3,]<-nfl_cur3 %>% sample_n(1)
    nfl_cur4 <- subset(nfl_cur3, !(wteam %in% start$wteam))
    nfl_cur4 <- subset(nfl_cur3, !(week %in% start$week))
        ...
    outs[,i]<-start$dif  
    }

and then I repeat until I get to 23. However, when I run the code, after the first iteration, the "outs" dataframe gets filled with 0s, I assume because nfl_cur is still being filtered from start.

Any help would be appreciated!

CodePudding user response:

If I understood, this might help you

#Libraries

library(dplyr)

#Example Data
df <-
  structure(list(wteam = c("a", "a", "b", "c", "c", "d" ), week = c(1, 1, 1, 2, 2, 2), dif = c(0.649077088, 0.089812768,  0.173061282, 0.362544332, 0.459545808, 0.331745704)), row.names = c(NA,  6L), class = "data.frame")

#Sample 1 by each wteam   week

df %>% 
  group_by(wteam,week) %>% 
  sample_n(1)

# A tibble: 4 x 3
# Groups:   wteam, week [4]
  wteam  week    dif
  <chr> <dbl>  <dbl>
1 a         1 0.0898
2 b         1 0.173 
3 c         2 0.363 
4 d         2 0.332 

CodePudding user response:

An option with data.table

library(data.table)
setDT(df)[, .SD[sample(seq_len(.N), 1)], .(wteam, week)]

-output

wteam week        dif
1:     a    1 0.08981277
2:     b    1 0.17306128
3:     c    2 0.36254433
4:     d    2 0.33174570
  • Related