Subsampling from a set with the assumption that each member would be picked at least one time in r-CodePudding

I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat). How should I control that all members would be picked at least in one run? I need the code in r.

To make the question more clear, lets define the row names as:

rownames(dataset) = A,B,C,D,E,F,G,H,J,I

if I subsample 3 times:

A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J

The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.

CodePudding user response：

One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.

num_rows = 1000
num_subsamples = 1000
subsample_size = 900

full_index = 1:num_rows

dat = data.frame(i = full_index)

# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
  # Assign the subsample that the row is guaranteed to appear in
  dat$guarantee = sample(1:num_subsamples, replace = TRUE)
  # Find the subsample with the most guaranteed slots taken
  biggest_guaranteed_subsample = max(table(dat$guarantee))
}


# Assign subsamples
for (ss in 1:num_subsamples) {
  # Pick out any rows guaranteed a slot in that subsample
  my_sub = dat[dat$guarantee == ss, 'i']
  # And randomly select the rest
  my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)], 
                            subsample_size - length(my_sub), 
                            replace = FALSE))
  # Do your subsample calculation here
}