Any speedier way to randomly subset vectors inside a list?-CodePudding

I'm looking for a speedy solution for randomly subsetting vectors nested in a list.

If we simulate the following data, we get a list l that holds 3 million vectors inside, each one is of length 5. But I want the length of each vector to vary. So I thought I should apply a function that randomly subsets each vector. The problem is, this method is not as speedy as I wished.

simulate data: the list l

library(stringi)

set.seed(123)
vec_n <- 15e6
vec_vals  <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)

my_named_vec <- setNames(vec_vals, vec_names)

split_func <- function(x, n) {
  unname(split(x, rep_len(1:n, length(x))))
}

l <- split_func(my_named_vec, n = vec_n / 5)

head(l)
#> [[1]]
#>    HmPsw    Qk8NP    Quo3T    8f0GH    nZmjN 
#>        1  3000001  6000001  9000001 12000001 
#> 
#> [[2]]
#>    2WtYS    ZaHFl    6YjId    jbGuA    tAG65 
#>        2  3000002  6000002  9000002 12000002 
#> 
#> [[3]]
#>    xSgZ6    jM5Uw    ujPOc    CTV5F    5JRT5 
#>        3  3000003  6000003  9000003 12000003 
#> 
#> [[4]]
#>    tF2Kx    r4ZCI    Ooklo    VOLHU    M6z6H 
#>        4  3000004  6000004  9000004 12000004 
#> 
#> [[5]]
#>    tgdze    w8d1B    FYERK    jlClo    NQfsF 
#>        5  3000005  6000005  9000005 12000005 
#> 
#> [[6]]
#>    hXaH9    gsY1u    CjBwC    Oqqty    dxJ4c 
#>        6  3000006  6000006  9000006 12000006

Now that we have l, I wish to subset each vector randomly: meaning that the number of elements being subsetted (per vector) will be random. So one option is to set the following utility function:

randomly_subset_vec <- function(x) {
  my_range <- 1:length(x)
  x[-sample(my_range, sample(my_range))]
}

lapply(head(l), randomly_subset_vec)
#> [[1]]
#>   Quo3T 
#> 6000001 
#> 
#> [[2]]
#>   6YjId   jbGuA 
#> 6000002 9000002 
#> 
#> [[3]]
#>   xSgZ6   jM5Uw   ujPOc   CTV5F 
#>       3 3000003 6000003 9000003 
#> 
#> [[4]]
#>   Ooklo 
#> 6000004 
#> 
#> [[5]]
#> named integer(0)
#> 
#> [[6]]
#>    CjBwC    Oqqty    dxJ4c 
#>  6000006  9000006 12000006

But running this procedure over the entire l takes forever. I've tried using rrapply which is a fast package for dealing with lists, and it takes "only" 110 seconds on my machine.

library(rrapply)
library(tictoc)

tic()
l_subsetted <- rrapply(object = l, f = randomly_subset_vec)
toc()
#> 110.23 sec elapsed

I will be happy with either of the following:

Is there a speedier alternative to:

rrapply(object = l, f = randomly_subset_vec)

Or more generally, is there a speedier way to start with my_named_vec and arrive at l_subsetted?

CodePudding user response：

Simplify the sampling function:

randomly_subset_vec_2 <- function(x) {
  my_range <- length(x)
  x[-sample(my_range, sample(my_range, 1))]
}

This alone can give a significant speed-up.
And though I have not tested it, given the problem description, to remove some elements (minus sign before sample) is to keep the others. Why not extract some elements (no minus sign) thereby keeping those?

Simpler and faster: To sample directly from x is the fastest so far.

randomly_subset_vec_3 <- function(x) {
  sample(x, sample(length(x), 1))
}

CodePudding user response：

Very rough and I'm not particularly proud of this. I'm sure there is a more elegant way but this ran in the matter of seconds on my machine

> # Make some fake data
> out <- lapply(1:3000000, function(i){sample(LETTERS, 5, replace = FALSE)})
> out[1:5]
[[1]]
[1] "D" "H" "C" "Y" "V"

[[2]]
[1] "M" "E" "H" "G" "S"

[[3]]
[1] "R" "P" "O" "L" "M"

[[4]]
[1] "C" "U" "G" "Q" "X"

[[5]]
[1] "Q" "L" "W" "O" "V"

> # Create list with ids to sample
> id <- lapply(1:3000000, function(i){sample(1:5, sample(1:5, 1), replace = FALSE)})
> id[1:5]
[[1]]
[1] 2

[[2]]
[1] 2 3 4 1 5

[[3]]
[1] 4

[[4]]
[1] 5

[[5]]
[1] 1 2

> # Extract the ids from the original data using the id list.
> # Like I said I'm not particularly proud of this but it gets the job
> # done quick enough on my computer
> out <- lapply(1:3000000, function(i){out[[i]][id[[i]]]})
> out[1:5]
[[1]]
[1] "H"

[[2]]
[1] "E" "H" "G" "M" "S"

[[3]]
[1] "L"

[[4]]
[1] "X"

[[5]]
[1] "Q" "L"

CodePudding user response：

Maybe we can replace randomly_subset_vec with something simpler with sample and sample.int:

lapply(l, function(x) x[sample.int(5, sample(5, 1))])

CodePudding user response：

It seems that the largest bottleneck is running all the sample calls, so we could try the following. One way, is the solution by Julius Vainora. First, we generate funFast by Rcpp:

library(inline)
library(Rcpp)
src <- 
'
int num = as<int>(size), x = as<int>(n);
Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x);
Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob);
Rcpp::NumericVector rnd = rexp(x) / pr;
for(int i= 0; i<vx.size();   i) vx[i] = i;
std::partial_sort(vx.begin(), vx.begin()   num, vx.end(), Comp(rnd));
vx = vx[seq(0, num - 1)]   1;
return vx;
'
incl <- 
'
struct Comp{
  Comp(const Rcpp::NumericVector& v ) : _v(v) {}
  bool operator ()(int a, int b) { return _v[a] < _v[b]; }
  const Rcpp::NumericVector& _v;
};
'
funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"),
                       src, plugin = "Rcpp", include = incl)

Then, define an alternative to your randomly_subset_vec using funFast instead of sample:

'randomly_subset_vec_2' <- function(x) {
  range <- length(x)
  probs <- rep(1/range, range)
  
  o <- funFast(range, size = funFast(range, size = 1, prob = probs), prob = probs)
  return(x[-o])
}

tic();obj <- rrapply(object = l, f = randomly_subset_vec_2);toc();

CodePudding user response：

Your subsets don't include the full set, so this first removes a random element from each vector, then randomly retains all other elements:

system.time({
  lenl <- lengths(l)
  # use stack to unlist the list while keeping the originating list index for each value
  temp <- stack(setNames(l, seq_along(l)))[
    # randomly remove one value from each vector
    -(ceiling(runif(length(l))*lenl)   c(0, head(cumsum(lenl), -1))),][
      # randomly keep the remaining elements
      sample(c(FALSE, TRUE), sum(lenl) - length(l), replace = TRUE),]
  # re-list
  l_subsetted <- unname(split(setNames(temp$values, rownames(temp)), temp$ind))
})

   user  system elapsed 
 25.360   0.220  25.576