Home > database >  Generate first N combinations of length L from weighted set in order of weight
Generate first N combinations of length L from weighted set in order of weight

Time:05-05

I have a set of letters with weights, which gives their probability of appearing in a string:

a - 0.7
b - 0.1
c - 0.3
...
z - 0.01

As such, the word aaaa has a probability of 0.7*0.7*0.7*0.7 = 0.24. The word aaac would have probability 0.7*0.7*0.7*0.3 = 0.10. All permutations of the same word have the same probability, so we only need to worry about combinations.

I would like to generate the first unique N strings of length L in order of probability (for example here, with 4 letters and length 4, that should be aaaa, aaac, aacc, aaab, accc, aabc, cccc, etc).

Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)

CodePudding user response:

Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.

Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.

The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.

Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.

symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4

import heapq
import math

loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()

heap = [(0, 0, "")]
while heap:
    min_loss, i, s = heapq.heappop(heap)
    if len(s) < L:
        heapq.heappush(heap, (min_loss, i, s   loss_symbol_pairs[i][1]))
        if i   1 < len(loss_symbol_pairs):
            heapq.heappush(
                heap,
                (
                    min_loss
                      (L - len(s))
                    * (loss_symbol_pairs[i   1][0] - loss_symbol_pairs[i][0]),
                    i   1,
                    s,
                ),
            )
    else:
        print(s)

Output:

aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz
  • Related