Is there a way to sample X number of random rows and X non-random rows in a single sample?
For example, I want to get 1,000 samples of 4 rows of iris
. I want to randomly sample 3 rows of iris
and the fourth row will be the same one in each sample (this is to mimic a hybrid sampling design).
I can sample 3 random rows 1000x and the fixed row 1000x and then merge the two data frames together, but for a few reasons this is not an ideal situation. The code to do that looks something like the following:
df<- iris
fixed_sample<- iris[7,]
random<- list()
fixed<- list()
counter<- 0
for (i in 1:1000) {
# sample 4 randomly selected transects 100 time
tempsample_random<- df[sample(1:nrow(df), 3, replace=F),]
tempsample_fixed<- fixed_sample[sample(1:nrow(fixed_sample), 1, replace=F), ]
random[[i]]=tempsample_random
fixed[[i]]=tempsample_fixed
counter<- counter 1
print(counter)
}
random_results<- do.call(rbind, random)
fixed_results<- do.call(rbind, fixed)
From here I would make a new column as a grouping variable and then merge them together based on that group. So every four rows of the final data frame has 3 random rows and row number 7 (fixed_sample
) in each sample.
I've looked into using splitstackshape::stratified
, but haven't gotten it to work the way I need it to. I'll be doing this over several levels of sampling effort (sample 2, 3, 4, 5 rows, etc. 1,000x each) so it would be ideal to be able to pull the fixed and random rows in the same sample from the beginning.
Any help would be greatly appreciated.
CodePudding user response:
I think you can do this in a single line using lapply
. In this case we will draw 3 samples, but you can change seq(3)
to seq(1000)
to get your 1000 samples. I have followed your example and selected row 7 as the fixed row.
lapply(seq(3), function(i) iris[c(sample(seq(nrow(iris))[-7], 3), 7),])
#> [[1]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 67 5.6 3.0 4.5 1.5 versicolor
#> 105 6.5 3.0 5.8 2.2 virginica
#> 111 6.5 3.2 5.1 2.0 virginica
#> 7 4.6 3.4 1.4 0.3 setosa
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 147 6.3 2.5 5.0 1.9 virginica
#> 131 7.4 2.8 6.1 1.9 virginica
#> 126 7.2 3.2 6.0 1.8 virginica
#> 7 4.6 3.4 1.4 0.3 setosa
#>
#> [[3]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 143 5.8 2.7 5.1 1.9 virginica
#> 145 6.7 3.3 5.7 2.5 virginica
#> 60 5.2 2.7 3.9 1.4 versicolor
#> 7 4.6 3.4 1.4 0.3 setosa
Created on 2022-05-18 by the reprex package (v2.0.1)
CodePudding user response:
Here's a method:
fixed_row <- 7
set.seed(42)
random <- replicate(1000, df[c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3)),], simplify = FALSE)
random[1:3]
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 7 4.6 3.4 1.4 0.3 setosa
# 50 5.0 3.3 1.4 0.2 setosa
# 66 6.7 3.1 4.4 1.4 versicolor
# 75 6.4 2.9 4.3 1.3 versicolor
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 7 4.6 3.4 1.4 0.3 setosa
# 147 6.3 2.5 5.0 1.9 virginica
# 123 7.7 2.8 6.7 2.0 virginica
# 50 5.0 3.3 1.4 0.2 setosa
# [[3]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 7 4.6 3.4 1.4 0.3 setosa
# 129 6.4 2.8 5.6 2.1 virginica
# 48 4.6 3.2 1.4 0.2 setosa
# 25 4.8 3.4 1.9 0.2 setosa
The intent is that we sample all rows except the fixed row that you intend to include in all samples, then prepend it to the list of row indices. Using the premise of setdiff(.., fixed_row)
allows you to use arbitrary sets here, so it would be feasible for fixed_row
to have zero or more row indices with the desired end result.
set.seed(42)
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1] 7 50 66 75
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1] 7 147 123 50
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1] 7 129 48 25
(Note that the use of set.seed
is just for reproducibility here on StackOverflow, you should likely not use that in production.)
CodePudding user response:
df <- iris
fixed_row = 2
resample_count = 1000
keep_rows <- unlist(
Map(1:resample_count,
f = function(x) c(fixed_row, sample(1:nrow(df),3))
)
)
resamples <- iris[keep_rows,]