I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat). How should I control that all members would be picked at least in one run? I need the code in r.
To make the question more clear, lets define the row names as:
rownames(dataset) = A,B,C,D,E,F,G,H,J,I
if I subsample 3 times:
A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J
The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.
CodePudding user response:
One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.
num_rows = 1000
num_subsamples = 1000
subsample_size = 900
full_index = 1:num_rows
dat = data.frame(i = full_index)
# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
# Assign the subsample that the row is guaranteed to appear in
dat$guarantee = sample(1:num_subsamples, replace = TRUE)
# Find the subsample with the most guaranteed slots taken
biggest_guaranteed_subsample = max(table(dat$guarantee))
}
# Assign subsamples
for (ss in 1:num_subsamples) {
# Pick out any rows guaranteed a slot in that subsample
my_sub = dat[dat$guarantee == ss, 'i']
# And randomly select the rest
my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)],
subsample_size - length(my_sub),
replace = FALSE))
# Do your subsample calculation here
}