How to find the maximum probability of satisfying the conditions in all combinations of arrays-CodePudding

for example, I got a list of tokens and each token's number of characters(length) is

length = [2, 1, 1, 2, 2, 3, 2, 1, 1, 2, 2, 2]

and here is the list of each token's probability of [not insert a linefeed, insert a linefeed] after the token

prob =  [[9.9978e-01, 2.2339e-04],  [9.9995e-01, 4.9344e-05],  [0.9469, 0.0531],  
         [9.9994e-01, 5.8422e-05],  [0.9964, 0.0036],  [9.9991e-01, 9.4295e-05],  
         [9.9980e-01, 1.9620e-04],  [1.0000e 00, 5.2492e-08],  [9.9998e-01, 1.8293e-05],  
         [9.9999e-01, 5.1220e-06],  [1.0000e 00, 3.9795e-06],  [0.0142, 0.9858]]

and the result for the probabilies is

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

which means inserting a linefeed after the last token. The whole length of this line is 21, and I would like to have a maximum of 20 characters per line. In that case, I have to insert one (in this example, maybe more in other situations) more linefeed to make sure every line has 20 characters at most. In this example, the best answer is

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1]

since the 3rd token gets the highest probability of inserting a linefeed.

My thought is to make all combinations of these probabilities.(Multiply them instead of adding) I got 12 tokens in this example, each token gets its 0-1 Classification Probability, so there are 2^12 kinds of combination. And I use the binary sequence to record every situation (since it's a 0-1 Classification Problem)and store them in a dictionary in format of [binary sequence, the combination of probabilities].

    for i in range(nums):
        num *= 2
    numx = bin(num)
    for i in range(num):
        numx = bin(numx - 1)
        str1 = numx.encode('ascii').decode('ascii')
        str1 = str1.lstrip('0b')
        probb = 1
        for k in range(len(str1)):
            x = str1[k]
            if int(x) == 0:          # [0, 1]
                probb *= prob_task2[k][0]
            else:
                probb *= prob_task2[k][1]
        dic[str1] = probb

Then I want to sort all kinds of combination, and search the possible result from high to low. I make two loops for making all combinations. And another two loops for searching the combinations from top to low in order to meet the restriction of characters. But I got some troubles with the efficiency, since once there are 40 tokens, I have to count 2^40 kinds of combinations. I am not good at algorithms, so I do want to ask is there an efficient way to solve the problem.

CodePudding user response：

To rephrase, you have a list of tokens of given lengths, each with an independent probability of being followed by a line break, and you want to find the maximum likelihood outcome whose longest line doesn’t exceed the given max.

There is an efficient dynamic program (O(n L) where n is the number of tokens and L is the line length). The idea is that we can prevent the search tree from blowing up exponentially by pruning the less likely possibilities that have the same current line length. In Python:

import collections
import math

length = [2, 1, 1, 2, 2, 3, 2, 1, 1, 2, 2, 2]
prob = [
    [9.9978e-1, 2.2339e-4],
    [9.9995e-1, 4.9344e-5],
    [0.9469, 0.0531],
    [9.9994e-1, 5.8422e-5],
    [0.9964, 0.0036],
    [9.9991e-1, 9.4295e-5],
    [9.998e-1, 1.962e-4],
    [1.0e0, 5.2492e-8],
    [9.9998e-1, 1.8293e-5],
    [9.9999e-1, 5.122e-6],
    [1.0e0, 3.9795e-6],
    [0.0142, 0.9858],
]
max_line_length = 20

line_length_to_best = {length[0]: (0, [])}
for i, (p_no_break, p_break) in enumerate(prob[:-1]):
    line_length_to_options = collections.defaultdict(list)
    for line_length, (likelihood, breaks) in line_length_to_best.items():
        length_without_break = line_length   length[i   1]
        if length_without_break <= max_line_length:
            line_length_to_options[length_without_break].append(
                (likelihood   math.log2(p_no_break), breaks   [0])
            )
        line_length_to_options[length[i   1]].append(
            (likelihood   math.log2(p_break), breaks   [1])
        )
    line_length_to_best = {
        line_length: max(options)
        for (line_length, options) in line_length_to_options.items()
    }
_, breaks = max(line_length_to_best.values())
print(breaks   [1])