I tried to sample 25 samples by using lapply,
a = list(c(1:5),c(100:105),c(110:115),c(57:62),c(27:32))
lapply(a,function(x)sample(x,5))
is it possible to use base::sample to do the vectorized sampling?
i.e.
sample(c(5,5),a)
CodePudding user response:
No. There's no option to stratify the sampling vector with sample()
. lapply()
is the way to go.
CodePudding user response:
It is not possible using base::sample
; however, this kind of vectorized sampling is possible by using runif
.
I don't have a good way to vectorize sampling without replacement for an arbitrary number of samples from each vector in x
. But we can sample each element of each vector.
Here's a function that vectorizes sampling over a list of vectors. It will return a single vector of samples:
multisample <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) 1 rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)]
} else {
unlist(x)[rank(runif(sum(lengths(x))) rep.int(seq_along(x), lengths(x)))]
}
}
The equivalent function using lapply
:
multisample2 <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
unlist(lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1)))
} else {
unlist(lapply(x, sample))
}
}
Example usage:
x <- list(c(1:9), c(11:18), c(21:27), c(31:36), c(41:45))
# sampling without replacement
multisample(x)
#> [1] 9 3 5 8 7 2 1 4 6 18 11 17 12 16 14 13 15 22 26 25 21 27 24 23 36
#> [26] 31 35 34 33 32 45 43 42 44 41
multisample2(x)
#> [1] 3 6 7 9 2 1 8 4 5 17 16 11 15 14 13 12 18 23 22 26 21 27 24 25 33
#> [26] 32 35 34 31 36 42 43 41 44 45
# sampling with replacement
n <- 7:3 # the number of samples from each vector
multisample(x, n, 1)
#> [1] 9 8 5 9 3 5 3 12 18 12 17 12 16 26 26 24 26 27 33 33 35 32 44 44 43
multisample2(x, n, 1)
#> [1] 9 8 3 7 8 7 8 15 14 15 16 18 14 27 27 21 27 27 33 36 33 34 45 44 41
The vectorized version is considerably faster:
x <- lapply(sample(10:15, 1e4, 1), seq)
n <- sample(10, 1e4, 1)
microbenchmark::microbenchmark(multisample = multisample(x),
multisample2 = multisample2(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multisample 7.4963 7.993501 8.629845 8.273701 8.732952 13.2050 100
#> multisample2 36.4702 40.518801 41.929437 41.701352 43.040650 63.4695 100
microbenchmark::microbenchmark(multisample = multisample(x, n, 1),
multisample2 = multisample2(x, n, 1))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multisample 2.326502 2.39170 2.842023 2.7672 3.183101 4.161801 100
#> multisample2 33.700001 37.61035 39.468619 39.1137 40.055901 72.030602 100
If a list of vectors in desired instead, the functions can be modified:
multisample <- function(x, n = lengths(x), replace = FALSE) {
i <- rep.int(seq_along(x), n)
if (replace) {
split(unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) 1 rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)], i)
} else {
split(unlist(x)[rank(runif(sum(lengths(x))) i)], i)
}
}
multisample2 <- function(x, n = lengths(x), replace = FALSE) {
if (replace) {
lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1))
} else {
lapply(x, sample)
}
}
The vectorized version is still much faster.