I have a dataset with two grouping variables and one outcome variable. I am trying to simulate sampling from this dataset, but I only want samples where no ID from either variable repeats.
My data is structured like this but with a few hundred rows:
structure(list(wteam = c("a", "a", "b", "c", "c", "d" ), week = c(1, 1, 1, 2, 2, 2), dif = c(0.649077088, 0.089812768, 0.173061282, 0.362544332, 0.459545808, 0.331745704)), row.names = c(NA, 6L), class = "data.frame")
I'm trying to sample 23 observations from the dataset such that neither wteam nor week can are repeated in the sample.
My current approach is horribly inefficient:
sims<-10
weeks<-23
outs<-as.data.frame(matrix(0,ncol = sims, nrow = weeks))
start<-as.data.frame(matrix(0,ncol = 3, nrow = 23))
names(start)<-names(nfl_cur)
for(i in 1:sims) {
start[1,]<- nfl_cur %>% sample_n(1)
nfl_cur2 <- subset(nfl_cur, !(wteam %in% start$wteam))
nfl_cur2 <- subset(nfl_cur, !(week %in% start$week))
start[2,]<-nfl_cur2 %>% sample_n(1)
nfl_cur3 <- subset(nfl_cur2, !(wteam %in% start$wteam))
nfl_cur3 <- subset(nfl_cur2, !(week %in% start$week))
start[3,]<-nfl_cur3 %>% sample_n(1)
nfl_cur4 <- subset(nfl_cur3, !(wteam %in% start$wteam))
nfl_cur4 <- subset(nfl_cur3, !(week %in% start$week))
...
outs[,i]<-start$dif
}
and then I repeat until I get to 23. However, when I run the code, after the first iteration, the "outs" dataframe gets filled with 0s, I assume because nfl_cur is still being filtered from start.
Any help would be appreciated!
CodePudding user response:
If I understood, this might help you
#Libraries
library(dplyr)
#Example Data
df <-
structure(list(wteam = c("a", "a", "b", "c", "c", "d" ), week = c(1, 1, 1, 2, 2, 2), dif = c(0.649077088, 0.089812768, 0.173061282, 0.362544332, 0.459545808, 0.331745704)), row.names = c(NA, 6L), class = "data.frame")
#Sample 1 by each wteam week
df %>%
group_by(wteam,week) %>%
sample_n(1)
# A tibble: 4 x 3
# Groups: wteam, week [4]
wteam week dif
<chr> <dbl> <dbl>
1 a 1 0.0898
2 b 1 0.173
3 c 2 0.363
4 d 2 0.332
CodePudding user response:
An option with data.table
library(data.table)
setDT(df)[, .SD[sample(seq_len(.N), 1)], .(wteam, week)]
-output
wteam week dif
1: a 1 0.08981277
2: b 1 0.17306128
3: c 2 0.36254433
4: d 2 0.33174570