Creating Probability Trees in R-CodePudding

I am working with the R programming language.

Suppose I have the following set up:

There are 5 objects : A, B, C, D, E
The probability for each of these objects to be chosen is : 0.2, 0.3, 0.1, 0.3, 0.1
You want to pick 5 of these objects with replacement (e.g. ABACD, DDBCA, etc.)

I want to find out (exact solution):

All combinations that can be made from these 5 objects
The probability for each of these combinations

Currently, I do not know how to do this - I tried to do this by simulating a "large number of combinations" and hoping I sufficiently saw enough of each combination to infer the correct probability :

library(dplyr)


results <- list()

 for (i in 1:100) {

iteration = i 
sample_i = sample(c("A", "B", "C", "D", "E"), size =5, replace = T, prob= c( 0.2, 0.3, 0.1, 0.3, 0.1))


my_data_i = data.frame(iteration, sample_i )

results[[i]] <- my_data_i

}

results_df <- data.frame(do.call(rbind.data.frame, results))

But this is looking like a very complicated way of solving this problem. In the end, I would be looking for something like this:

AAAAA : Prob = 0.03
AABDE: Prob = 0.06
DEECB : Prob = 0.07
etc.

Can someone please show me how to do this?

Thanks!

CodePudding user response：

The overall probability of each permutation is the product of the probability of each selected element.

library(RcppAlgos)

# Probabilities
probs <- setNames(c(0.2, 0.3, 0.1, 0.3, 0.1), LETTERS[1:5])

# Generate permutations
perms <- permuteGeneral(names(probs), repetition = TRUE)

# Collapse permutations
perm_res <- do.call(paste, c(asplit(perms, 2), sep = ""))

# Replace with probability values and coerce to numeric
perms[] <- probs[perms]
class(perms) <- "numeric"

# Calculate products
res <- data.frame(perm_res, prob = exp(rowSums(log(perms))))
head(res)

  perm_res    prob
1    AAAAA 0.00032
2    AAAAB 0.00048
3    AAAAC 0.00016
4    AAAAD 0.00048
5    AAAAE 0.00016
6    AAABA 0.00048

# Check total sums to 1
sum(res$prob)
[1] 1

CodePudding user response：

edit: this solution works for the example, but quickly runs out of memory for probabilities with more significant digits.

Create vector of object labels corresponding to the given probabilities.
Use expand.grid to generate all possible length-5 combinations.
Number of unique rows in the result == number of possible combinations.
Proportion of each combination in result == probability of each combination.

objs <- c(
  rep("A", 2),
  rep("B", 3),
  "C",
  rep("D", 3),
  "E"
)
combos <- expand.grid(
  p1 = objs,
  p2 = objs,
  p3 = objs,
  p4 = objs,
  p5 = objs
)
combos <- paste0(
  combos$p1, 
  combos$p2,
  combos$p3,
  combos$p4,
  combos$p5
)
n_combos <- length(combos)
combos_unique <- unique(combos)

# number of combinations
length(combos_unique)
# 3125

# probability of each combination
setNames(
  sapply(combos_unique, \(x) sum(combos == x) / n_combos),
  combos_unique
)

#   AAAAA   BAAAA   CAAAA   DAAAA   EAAAA   ABAAA   BBAAA   CBAAA   DBAAA   EBAAA 
# 0.00032 0.00048 0.00016 0.00048 0.00016 0.00048 0.00072 0.00024 0.00072 0.00024 
#   ACAAA   BCAAA   CCAAA   DCAAA   ECAAA   ADAAA   BDAAA   CDAAA   DDAAA   EDAAA 
# 0.00016 0.00024 0.00008 0.00024 0.00008 0.00048 0.00072 0.00024 0.00072 0.00024 
...

The problem with this solution is that the number of rows will rapidly expand not only with more objects or longer combo length, but also with probabilities with more significant digits. eg, I only needed 3 "B"s to simulate a probability of .3, but would need 325 for a probability of .325.