Home > Back-end >  Randomizing a distribution of data in a list
Randomizing a distribution of data in a list

Time:04-10

I have a data frame df that I would like to separate into a training set and a test set. Instead of getting only a single training and test set, I would like to get a distribution of them (n = 100).

I try and do this with lapply, but the values for each element in the list end up being exactly the same. How do I randomize the values in the two list (i.e., train.data and test.data)?

The expected output would be a list for both train.data and test.data, each containing 100 elements with different subsets of df in both of them.

library(lubridate)
library(tidyverse)
library(caret)

date <- rep_len(seq(dmy("01-01-2013"), dmy("31-12-2013"), by = "days"), 300)
ID <-  rep(c("A","B","C"), 50)
class <-  rep(c("N","M"), 50)
df <- data.frame(value  = runif(length(date), min = 0.5, max = 25),
                 ID, 
                 class)
training.samples <- df$class %>% 
  createDataPartition(p = 0.6, list = FALSE)


n <- 100

train.data  <- lapply(1:n, function(x){
  df[training.samples, ]
})
test.data <- lapply(1:n, function(x){
  df[-training.samples, ]
})

CodePudding user response:

Try using replicate

f1 <- function(dat, colnm) {
  s1 <- createDataPartition(dat[[colnm]], p = 0.6,
     list = FALSE)
  return(list(train.data = dat[s1,], test.data = dat[-s1,]))
}
n <- 100
out <- replicate(n, f1(df, "class"), simplify = FALSE)
  • Related