Creating multiple training subsets using sample() in R-CodePudding

I have a training dataset that consists of 60,000 observations that I want to create 9 subset training sets from. I want to sample randomly without replacement; I need 3 datasets of 500 observations, 3 datasets of 1,000 observations, and 3 datasets of 2,000 observations.

How can I do this using sample() in R?

CodePudding user response：

Given your data.frame is named df you do:

sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- sample(60000, sum(sample_sizes))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))

This do sampling without replacement over all dataset. If you want sampling without replacement in each training set (but not through all training sets):

sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))

CodePudding user response：

I'm not positive if you want the output to look like the screenshot, but if so, here you go:

library(tidyverse)

df <- tibble(rand = runif(6e4))

tibble(`Sample Size` = rep(c(500,1000,2000), each = 3)) |>
  mutate(name = rep(paste(c("First", "Second", "Third"), "Random Sample"), 3),
         samp = map2(`Sample Size`, row_number(), 
                     \(x,y) {set.seed(y); df[sample(1:nrow(df), size = x),]})) |>
  pivot_wider(names_from = name, values_from = samp)
#> # A tibble: 3 x 4
#>   `Sample Size` `First Random Sample` `Second Random Sample` Third Random Samp~1
#>           <dbl> <list>                <list>                 <list>             
#> 1           500 <tibble [500 x 1]>    <tibble [500 x 1]>     <tibble [500 x 1]> 
#> 2          1000 <tibble [1,000 x 1]>  <tibble [1,000 x 1]>   <tibble>           
#> 3          2000 <tibble [2,000 x 1]>  <tibble [2,000 x 1]>   <tibble>           
#> # ... with abbreviated variable name 1: `Third Random Sample`