I have a training dataset that consists of 60,000 observations that I want to create 9 subset training sets from. I want to sample randomly without replacement; I need 3 datasets of 500 observations, 3 datasets of 1,000 observations, and 3 datasets of 2,000 observations.
How can I do this using sample() in R?
CodePudding user response:
Given your data.frame is named df
you do:
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- sample(60000, sum(sample_sizes))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))
This do sampling without replacement over all dataset. If you want sampling without replacement in each training set (but not through all training sets):
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes))
CodePudding user response:
I'm not positive if you want the output to look like the screenshot, but if so, here you go:
library(tidyverse)
df <- tibble(rand = runif(6e4))
tibble(`Sample Size` = rep(c(500,1000,2000), each = 3)) |>
mutate(name = rep(paste(c("First", "Second", "Third"), "Random Sample"), 3),
samp = map2(`Sample Size`, row_number(),
\(x,y) {set.seed(y); df[sample(1:nrow(df), size = x),]})) |>
pivot_wider(names_from = name, values_from = samp)
#> # A tibble: 3 x 4
#> `Sample Size` `First Random Sample` `Second Random Sample` Third Random Samp~1
#> <dbl> <list> <list> <list>
#> 1 500 <tibble [500 x 1]> <tibble [500 x 1]> <tibble [500 x 1]>
#> 2 1000 <tibble [1,000 x 1]> <tibble [1,000 x 1]> <tibble>
#> 3 2000 <tibble [2,000 x 1]> <tibble [2,000 x 1]> <tibble>
#> # ... with abbreviated variable name 1: `Third Random Sample`