Sample percentages in different groups without repeating R-CodePudding

I have a df such as

mydata <- data.frame(variable = runif(142),
                     block = sample(x = c(1,2,3,4,5), 142, replace = TRUE))

I'm trying to sample 80 and 20% of each block value, without repeating, and adding each fraction to new dfs called train (80%) and test (20%). Important: Sometimes my blocks will not have exactly 80-20, but I'm trying to get as close as possible to this value.

How to proceed?

I was using sample_frac but wasn't able to avoid repeating and joining the data after.

CodePudding user response：

We may use slice_sample with proportion as 0.8 after grouping by 'block'. Create a sequence column (row_number()) before grouping so that it can be used to create the 'test' data by removing those observations that were already taken in train

library(dplyr)
train <- mydata %>%
   mutate(rn = row_number()) %>% 
   group_by(block) %>% 
   slice_sample(prop = 0.8) %>% 
   ungroup
test <- mydata[setdiff(seq_len(nrow(mydata)), train$rn),]
train$rn <- NULL

CodePudding user response：

With rsample::initial_split with strata = block:

library(rsample)
split <- initial_split(mydata, prop = .8, strata = block)

training(split)
testing(split)

#> prop.table(table(training(split)$block))
#        1         2         3         4         5 
#0.1769912 0.2300885 0.2035398 0.1769912 0.2123894 
#> prop.table(table(testing(split)$block))
#        1         2         3         4         5 
#0.1724138 0.2413793 0.2068966 0.1724138 0.2068966