I have a df such as
mydata <- data.frame(variable = runif(142),
block = sample(x = c(1,2,3,4,5), 142, replace = TRUE))
I'm trying to sample 80 and 20% of each block
value, without repeating, and adding each fraction to new dfs called train
(80%) and test
(20%). Important: Sometimes my blocks will not have exactly 80-20, but I'm trying to get as close as possible to this value.
How to proceed?
I was using sample_frac
but wasn't able to avoid repeating and joining the data after.
CodePudding user response:
We may use slice_sample
with prop
ortion as 0.8 after grouping by 'block'. Create a sequence column (row_number()
) before grouping so that it can be used to create the 'test' data by removing those observations that were already taken in train
library(dplyr)
train <- mydata %>%
mutate(rn = row_number()) %>%
group_by(block) %>%
slice_sample(prop = 0.8) %>%
ungroup
test <- mydata[setdiff(seq_len(nrow(mydata)), train$rn),]
train$rn <- NULL
CodePudding user response:
With rsample::initial_split
with strata = block
:
library(rsample)
split <- initial_split(mydata, prop = .8, strata = block)
training(split)
testing(split)
#> prop.table(table(training(split)$block))
# 1 2 3 4 5
#0.1769912 0.2300885 0.2035398 0.1769912 0.2123894
#> prop.table(table(testing(split)$block))
# 1 2 3 4 5
#0.1724138 0.2413793 0.2068966 0.1724138 0.2068966