Is it possible to sample multiple times by using sample function in r?-CodePudding

I tried to sample 25 samples by using lapply,

a = list(c(1:5),c(100:105),c(110:115),c(57:62),c(27:32))

lapply(a,function(x)sample(x,5))

is it possible to use base::sample to do the vectorized sampling?

i.e.

sample(c(5,5),a)

CodePudding user response：

No. There's no option to stratify the sampling vector with sample(). lapply() is the way to go.

CodePudding user response：

It is not possible using base::sample; however, this kind of vectorized sampling is possible by using runif.

I don't have a good way to vectorize sampling without replacement for an arbitrary number of samples from each vector in x. But we can sample each element of each vector.

Here's a function that vectorizes sampling over a list of vectors. It will return a single vector of samples:

multisample <- function(x, n = lengths(x), replace = FALSE) {
  if (replace) {
    unlist(x)[rep.int(lengths(x), n)*runif(sum(n))   1   rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)]
  } else {
    unlist(x)[rank(runif(sum(lengths(x)))   rep.int(seq_along(x), lengths(x)))]
  }
}

The equivalent function using lapply:

multisample2 <- function(x, n = lengths(x), replace = FALSE) {
  if (replace) {
    unlist(lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1)))
  } else {
    unlist(lapply(x, sample))
  }
}

Example usage:

x <- list(c(1:9), c(11:18), c(21:27), c(31:36), c(41:45))

# sampling without replacement
multisample(x)
#>  [1]  9  3  5  8  7  2  1  4  6 18 11 17 12 16 14 13 15 22 26 25 21 27 24 23 36
#> [26] 31 35 34 33 32 45 43 42 44 41
multisample2(x)
#>  [1]  3  6  7  9  2  1  8  4  5 17 16 11 15 14 13 12 18 23 22 26 21 27 24 25 33
#> [26] 32 35 34 31 36 42 43 41 44 45

# sampling with replacement
n <- 7:3 # the number of samples from each vector
multisample(x, n, 1)
#>  [1]  9  8  5  9  3  5  3 12 18 12 17 12 16 26 26 24 26 27 33 33 35 32 44 44 43
multisample2(x, n, 1)
#>  [1]  9  8  3  7  8  7  8 15 14 15 16 18 14 27 27 21 27 27 33 36 33 34 45 44 41

The vectorized version is considerably faster:

x <- lapply(sample(10:15, 1e4, 1), seq)
n <- sample(10, 1e4, 1)

microbenchmark::microbenchmark(multisample = multisample(x),
                               multisample2 = multisample2(x))
#> Unit: milliseconds
#>          expr     min        lq      mean    median        uq     max neval
#>   multisample  7.4963  7.993501  8.629845  8.273701  8.732952 13.2050   100
#>  multisample2 36.4702 40.518801 41.929437 41.701352 43.040650 63.4695   100
microbenchmark::microbenchmark(multisample = multisample(x, n, 1),
                               multisample2 = multisample2(x, n, 1))
#> Unit: milliseconds
#>          expr       min       lq      mean  median        uq       max neval
#>   multisample  2.326502  2.39170  2.842023  2.7672  3.183101  4.161801   100
#>  multisample2 33.700001 37.61035 39.468619 39.1137 40.055901 72.030602   100

If a list of vectors in desired instead, the functions can be modified:

multisample <- function(x, n = lengths(x), replace = FALSE) {
  i <- rep.int(seq_along(x), n)
  if (replace) {
    split(unlist(x)[rep.int(lengths(x), n)*runif(sum(n))   1   rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)], i)
  } else {
    split(unlist(x)[rank(runif(sum(lengths(x)))   i)], i)
  }
}

multisample2 <- function(x, n = lengths(x), replace = FALSE) {
  if (replace) {
    lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1))
  } else {
    lapply(x, sample)
  }
}

The vectorized version is still much faster.