I have a list in which each row are differente registers of several species (that might repeat across the list). Each of these species belong to a given database (no species repeated inside the same dataset).
I need to randomly sample different registers (rows), however I want that the number of samples change with the number of the "step".
In the reproducible example (below), I would like:
step 1: 1 random sample (row),
step 2: 2 random samples (rows) from different datasets
...
step 11: 11 random samples (rows) from different datasets.
#Example:
x1 <- matrix(rnorm(200), nrow= 100, ncol=2)
x2 <- c(replicate(5, "AA"),replicate(15, "BB"),replicate(15, "CC"),
replicate(10, "DD"),replicate(10, "EE"),replicate(10, "FF"),
replicate(10, "GG"),replicate(5, "HH"),replicate(5, "II"),
replicate(15, "JJ"))
df <- data.frame(cbind(x1,x2))
colnames(df) <- c("variable1", "variable2","dataset")
The only thing I tried, but still is not what I want... because is sampling only accordingly to the dataset
install.packages("sampling")
library(sampling)
ob <- strata(df, "dataset", size = c(1:100), method = "srswr")
Any thoughts, please?
CodePudding user response:
If I understand you correctly, I think you want something like this (note, it ensures that at step n, there are n rows selected from n different datasets -- if that is not what you want, I can adjust):
library(data.table)
setDT(df)
lapply(1:5, \(i) {
ds = sample(unique(df$dataset),i)
df[dataset %chin% ds, .SD[sample(.N,1)], dataset]
})
Output:
[[1]]
dataset variable1 variable2
1: GG 0.891759430683143 -0.973274707214832
[[2]]
dataset variable1 variable2
1: FF -0.187478493738627 -0.643696750490574
2: GG 0.776141815765327 -0.825979276855279
[[3]]
dataset variable1 variable2
1: BB 0.251972001607678 1.19219655379958
2: CC 1.48277044726544 1.43059055432907
3: II 0.621661527125061 -1.29864843731135
[[4]]
dataset variable1 variable2
1: CC 0.521363736653211 0.512012278191707
2: FF -0.946818003900703 -0.73084715717486
3: GG 0.891759430683143 -0.973274707214832
4: II 0.586691851645424 -0.216393669254661
[[5]]
dataset variable1 variable2
1: CC 2.3988956446685 0.993219087408849
2: EE 0.545675659181279 -0.185124394415505
3: FF -0.187478493738627 -0.643696750490574
4: GG -0.335332679807122 -0.2908242586079
5: JJ -1.91097794113304 0.886747918349373
CodePudding user response:
Following your example, here is a slightly different approach:
For the first sample with one dataset:
ob1 <- strata(df[df$dataset=="AA", ], "dataset", size = 5, method = "srswr")
For the second sample with two datasets:
ob2 <- strata(df[df$dataset %in% c("AA", "BB"), ], "dataset", size = rep(5, 2), method = "srswr")
Or use sample(unique(df$dataset), 2)
to randomly select two data sets. To increase the number of datasets, just change the number 2 to the number of data sets you want in both sample()
and the size=rep()
argument. You can change the 5 to any number up to the size of the datasets.