Apologies if this is a little convoluted. I'm running an agent based simulation and would like to 'promote' n individuals at each timestep. I have a logistic model which, for each individual, gives me a predicted probability of their being promoted. I want to randomly select n individuals, weighted by their promotion probabilities, for promotion.
At present, I run this code:
test_frame <- data.frame(
id = seq(1,10),
promote_prob = sample(c(0.0000001, 0.5), 10, TRUE)
)
id_list <- data.frame(n = sample(test_frame$id,
nrow(test_frame),
prob = test_frame$promote_prob),
rank = seq(1, nrow(test_frame)))
test_frame %>%
left_join(id_list, by = c("id" = "n")) %>%
mutate(promote_flag = ifelse(rank < 3, 1, 0))
ID_list produces a random, weighted ranking of all rows in the table, based on their promotion probability. But the join operation makes this process very slow - it's the slowest step in the simulation by far. Is there a way to vectorise this series of steps? My experiments with this have not come to much - e.g.:
test_frame %>%
mutate(n = sample(seq(1:nrow(test_frame)), nrow(test_frame), FALSE, promote_prob)) %>%
mutate(promote = ifelse(n < 3, 1, 0))
CodePudding user response:
This should work:
set.seed(1)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
test_frame <- data.frame(
id = seq(1,10),
promote_prob = sample(c(0.0000001, 0.5), 10, TRUE)
)
test_frame %>%
mutate(promote = ifelse(id %in% sample(id, 2, replace=FALSE, promote_prob), 1,0))
#> id promote_prob promote
#> 1 1 1e-07 0
#> 2 2 5e-01 1
#> 3 3 1e-07 0
#> 4 4 1e-07 0
#> 5 5 5e-01 0
#> 6 6 1e-07 0
#> 7 7 1e-07 0
#> 8 8 1e-07 0
#> 9 9 5e-01 1
#> 10 10 5e-01 0
Created on 2022-04-26 by the reprex package (v2.0.1)
over 5000 iterations of this, observations 2, 5, 9 and 10 are chosen with approximately equal probability and the others are chosen not at all. The important bit is the 2
in sample(id, 2, ...)
which identifies the number of observations to be promoted.