Is there a way to vectorise this weighted, sampled ranking operation?-CodePudding

Apologies if this is a little convoluted. I'm running an agent based simulation and would like to 'promote' n individuals at each timestep. I have a logistic model which, for each individual, gives me a predicted probability of their being promoted. I want to randomly select n individuals, weighted by their promotion probabilities, for promotion.

At present, I run this code:

test_frame <- data.frame(
  id = seq(1,10),
  promote_prob = sample(c(0.0000001, 0.5), 10, TRUE)
)

id_list <- data.frame(n = sample(test_frame$id, 
                                 nrow(test_frame), 
                                 prob = test_frame$promote_prob),
                      rank = seq(1, nrow(test_frame)))

test_frame %>%
  left_join(id_list, by = c("id" = "n")) %>%
  mutate(promote_flag = ifelse(rank < 3, 1, 0))

ID_list produces a random, weighted ranking of all rows in the table, based on their promotion probability. But the join operation makes this process very slow - it's the slowest step in the simulation by far. Is there a way to vectorise this series of steps? My experiments with this have not come to much - e.g.:

test_frame %>%
  mutate(n = sample(seq(1:nrow(test_frame)), nrow(test_frame), FALSE, promote_prob)) %>%
  mutate(promote = ifelse(n < 3, 1, 0))

CodePudding user response：

This should work:

set.seed(1)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
test_frame <- data.frame(
  id = seq(1,10),
  promote_prob = sample(c(0.0000001, 0.5), 10, TRUE)
)

test_frame %>%
  mutate(promote = ifelse(id %in% sample(id, 2, replace=FALSE, promote_prob), 1,0))
#>    id promote_prob promote
#> 1   1        1e-07       0
#> 2   2        5e-01       1
#> 3   3        1e-07       0
#> 4   4        1e-07       0
#> 5   5        5e-01       0
#> 6   6        1e-07       0
#> 7   7        1e-07       0
#> 8   8        1e-07       0
#> 9   9        5e-01       1
#> 10 10        5e-01       0

^{Created on 2022-04-26 by the reprex package (v2.0.1)}

over 5000 iterations of this, observations 2, 5, 9 and 10 are chosen with approximately equal probability and the others are chosen not at all. The important bit is the 2 in sample(id, 2, ...) which identifies the number of observations to be promoted.