Sampling randomly and non-randomly in one sample-CodePudding

Is there a way to sample X number of random rows and X non-random rows in a single sample? For example, I want to get 1,000 samples of 4 rows of iris. I want to randomly sample 3 rows of iris and the fourth row will be the same one in each sample (this is to mimic a hybrid sampling design).

I can sample 3 random rows 1000x and the fixed row 1000x and then merge the two data frames together, but for a few reasons this is not an ideal situation. The code to do that looks something like the following:

df<- iris

fixed_sample<- iris[7,]

random<- list()
fixed<- list()

counter<- 0
for (i in 1:1000) {
  # sample 4 randomly selected transects 100 time
  tempsample_random<- df[sample(1:nrow(df), 3, replace=F),]
  tempsample_fixed<- fixed_sample[sample(1:nrow(fixed_sample), 1, replace=F), ]
  
  random[[i]]=tempsample_random
  fixed[[i]]=tempsample_fixed
  
  
  counter<- counter 1
  print(counter)
}


random_results<- do.call(rbind, random)
fixed_results<- do.call(rbind, fixed)

From here I would make a new column as a grouping variable and then merge them together based on that group. So every four rows of the final data frame has 3 random rows and row number 7 (fixed_sample) in each sample.

I've looked into using splitstackshape::stratified, but haven't gotten it to work the way I need it to. I'll be doing this over several levels of sampling effort (sample 2, 3, 4, 5 rows, etc. 1,000x each) so it would be ideal to be able to pull the fixed and random rows in the same sample from the beginning.

Any help would be greatly appreciated.

CodePudding user response：

I think you can do this in a single line using lapply. In this case we will draw 3 samples, but you can change seq(3) to seq(1000) to get your 1000 samples. I have followed your example and selected row 7 as the fixed row.

lapply(seq(3), function(i) iris[c(sample(seq(nrow(iris))[-7], 3), 7),])
#> [[1]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 105          6.5         3.0          5.8         2.2  virginica
#> 111          6.5         3.2          5.1         2.0  virginica
#> 7            4.6         3.4          1.4         0.3     setosa
#> 
#> [[2]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 147          6.3         2.5          5.0         1.9 virginica
#> 131          7.4         2.8          6.1         1.9 virginica
#> 126          7.2         3.2          6.0         1.8 virginica
#> 7            4.6         3.4          1.4         0.3    setosa
#> 
#> [[3]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 143          5.8         2.7          5.1         1.9  virginica
#> 145          6.7         3.3          5.7         2.5  virginica
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 7            4.6         3.4          1.4         0.3     setosa

^{Created on 2022-05-18 by the reprex package (v2.0.1)}

CodePudding user response：

Here's a method:

fixed_row <- 7
set.seed(42)
random <- replicate(1000, df[c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3)),], simplify = FALSE)
random[1:3]
# [[1]]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 7           4.6         3.4          1.4         0.3     setosa
# 50          5.0         3.3          1.4         0.2     setosa
# 66          6.7         3.1          4.4         1.4 versicolor
# 75          6.4         2.9          4.3         1.3 versicolor
# [[2]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 123          7.7         2.8          6.7         2.0 virginica
# 50           5.0         3.3          1.4         0.2    setosa
# [[3]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 129          6.4         2.8          5.6         2.1 virginica
# 48           4.6         3.2          1.4         0.2    setosa
# 25           4.8         3.4          1.9         0.2    setosa

The intent is that we sample all rows except the fixed row that you intend to include in all samples, then prepend it to the list of row indices. Using the premise of setdiff(.., fixed_row) allows you to use arbitrary sets here, so it would be feasible for fixed_row to have zero or more row indices with the desired end result.

set.seed(42)
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]  7 50 66 75
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 147 123  50
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 129  48  25

(Note that the use of set.seed is just for reproducibility here on StackOverflow, you should likely not use that in production.)

CodePudding user response：


df <- iris

fixed_row = 2
resample_count = 1000

keep_rows <- unlist(
  Map(1:resample_count,
      f = function(x) c(fixed_row, sample(1:nrow(df),3))
      )
)

resamples <- iris[keep_rows,]