Randomly sample a row from a list of rows/values with different length according each step in R-CodePudding

I have a list in which each row are differente registers of several species (that might repeat across the list). Each of these species belong to a given database (no species repeated inside the same dataset).

I need to randomly sample different registers (rows), however I want that the number of samples change with the number of the "step".

In the reproducible example (below), I would like:
step 1: 1 random sample (row),
step 2: 2 random samples (rows) from different datasets
...
step 11: 11 random samples (rows) from different datasets.

#Example:
x1 <- matrix(rnorm(200), nrow= 100, ncol=2)
x2 <- c(replicate(5, "AA"),replicate(15, "BB"),replicate(15, "CC"),
        replicate(10, "DD"),replicate(10, "EE"),replicate(10, "FF"),
        replicate(10, "GG"),replicate(5, "HH"),replicate(5, "II"),
        replicate(15, "JJ"))
df <- data.frame(cbind(x1,x2))
colnames(df) <- c("variable1", "variable2","dataset")

The only thing I tried, but still is not what I want... because is sampling only accordingly to the dataset

install.packages("sampling")
library(sampling)

ob <- strata(df, "dataset", size = c(1:100), method = "srswr")

Any thoughts, please?

CodePudding user response：

If I understand you correctly, I think you want something like this (note, it ensures that at step n, there are n rows selected from n different datasets -- if that is not what you want, I can adjust):

library(data.table)
setDT(df)

lapply(1:5, \(i) {
  ds = sample(unique(df$dataset),i)
  df[dataset %chin% ds, .SD[sample(.N,1)], dataset]
})

Output:

[[1]]
   dataset         variable1          variable2
1:      GG 0.891759430683143 -0.973274707214832

[[2]]
   dataset          variable1          variable2
1:      FF -0.187478493738627 -0.643696750490574
2:      GG  0.776141815765327 -0.825979276855279

[[3]]
   dataset         variable1         variable2
1:      BB 0.251972001607678  1.19219655379958
2:      CC  1.48277044726544  1.43059055432907
3:      II 0.621661527125061 -1.29864843731135

[[4]]
   dataset          variable1          variable2
1:      CC  0.521363736653211  0.512012278191707
2:      FF -0.946818003900703  -0.73084715717486
3:      GG  0.891759430683143 -0.973274707214832
4:      II  0.586691851645424 -0.216393669254661

[[5]]
   dataset          variable1          variable2
1:      CC    2.3988956446685  0.993219087408849
2:      EE  0.545675659181279 -0.185124394415505
3:      FF -0.187478493738627 -0.643696750490574
4:      GG -0.335332679807122   -0.2908242586079
5:      JJ  -1.91097794113304  0.886747918349373

CodePudding user response：

Following your example, here is a slightly different approach:

For the first sample with one dataset:

ob1 <- strata(df[df$dataset=="AA", ], "dataset", size = 5, method = "srswr")

For the second sample with two datasets:

ob2 <- strata(df[df$dataset %in% c("AA", "BB"), ], "dataset", size = rep(5, 2), method = "srswr")

Or use sample(unique(df$dataset), 2) to randomly select two data sets. To increase the number of datasets, just change the number 2 to the number of data sets you want in both sample() and the size=rep() argument. You can change the 5 to any number up to the size of the datasets.