Find combination of string sentences - Combinations of frequency tables to target frequency table-CodePudding

I have a list of sentences, for example a list of 1000 sentences.

I would like to find a combination of sentences to match/'match closest' a certain frequency table:

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

I thought about finding all possible combinations from the sentences list by using combinations like in here (so comb(1000, 1); to comb(100, 1000); ) and then compare every combination with the frequency table, so that distance is minimum. So sum all frequency tables from a possible combination and compare this sum with the target, the combination with the smallest difference with the target should be recorded. There could be multiple combinations that match closest.

The problem is that the calculation of all combinations takes way too long to complete, apparently couple of days. Is there a known algorithm that could solve this efficiently? Ideally couple of minutes maximum?

Input sentences:

More RVs were seen in the storage lot than at the campground.

She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days.

The swirled lollipop had issues with the pop rock candy.

The two walked down the slot canyon oblivious to the sound of thunder in the distance.

Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.

He is no James Bond; his name is Roger Moore.

The tumbleweed refused to tumble but was more than willing to prance.

She was disgusted he couldn’t tell the difference between lemonade and > limeade.

He didn’t want to go to the dentist, yet he went anyway.

Find combination of sentences that match the following frequency table closest:

[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]

Example:

Frequency table of sixth sentence

He is no James Bond; his name is Roger Moore.

is [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]

Frequency table takes upper and lower equal and excludes special characters.

CodePudding user response：

This can be reduced to the subsequence sum with least absolute difference with a target problem.

The problem is as follows: You have an array A with integer values, say [1,5,3,2,6], and an integer value T, the target. You want to find the subsequence A' of elements from A such that abs(target - sum(A')) is minimized.

In your case, the individual integer values of A are 2 dimensional where they contain each sentence's frequency table for its characters and the target is also 2 dimensional as it contains counts of characters. You want to minimize the sum of the absolute difference.

This is clearly a dynamic programming problem. Without optimization the time complexity would be exponential where we need to check 2^n possibilities (for each element we have 2 possibilities: we either take it or leave it). I think that's what you referred to in your question by creating all combinations.

But with optimization we can achieve n * T where n is the number of elements in A and T is the value of target. This is of course if we only wanted the closest number itself, not the elements that sum to that number.

To get the elements of the subsequence itself that leads to the optimal solution you have 2 options:

Backtracking, which has the exponential time complexity explained earlier.
DP with path reconstruction where the time complexity remains manageable as explained above.

These problems and algorithms are well known and I don't think they need explaining.

How your specific problem maps to this problem, as far as I understand, is also evident. There are of course some complexities in how you want to implement it. But if the relation between your problem and the subsequence sum problem as described above is not clear, please let me know so I can explain further.

Here are a few links I found that may help you to solve this problem. Please note that they are not a straight forward answer as this problem is relatively complex.

Closest Subsequence Sum Problem on LeetCode. This handles the case where you are only looking for the closest sum, not the path that lead to that sum. The discussion page is full of different ideas with detailed explanations (sort by most votes).
DP and Path Reconstruction: This is part of a series about DP.
Primer on DP
Reconstructing the Path of the Optimal Solution

CodePudding user response：

A greedy algorithm

Your first idea to test all the possible combinations of sentences is too slow. If you have n sentences, then there are 2**n (2 to the power of n) possible combinations of sentences. For instance with n=1000, there are 2**1000 ≈ 10**300 possible combinations. That's a 1 followed by 300 zeroes: more than the number of particles in the universe, and more than the number of different possible games of chess!

Here is a suggestion for a greedy algorithm. It's not particularly optimised, and its running time is O(k * n**2), where n is the number of sentences and k is the length of the longest sentence.

The idea is the following:

Attribute to each sentence the score number of useful characters - number of superfluous characters. For instance, if a sentence contains 20 'a' and the target requires only 15 'a', we're going to count 15 useful 'a' and 5 superfluous 'a', so character 'a' contributes 10 to the score of that sentence.
Add the sentence with the highest score to the result;
Update the target to remove the characters that are already in the result;
Update the score of every sentence to reflect the updated target.
Loop until no sentence has a positive score.

I was too lazy to implement it in C , so here it is in python, using a max-heap and a Counter. After the code I wrote a quick explanation to help you translate it into C .

from collections import Counter
import heapq

sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']

target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})

print(target)

counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences]  # remove punctuation, spaces, uncapitalize, then count frequencies

def get_score(sentence_count, target):
    return sum((sentence_count & target).values()) - sum((sentence_count - target).values())

candidates = []
for sentence, count in zip(sentences, counts):
    score = get_score(count, target)
    candidates.append((-score, sentence, count))

heapq.heapify(candidates)    # order candidates by score
                             # python's heapq only handles min-heap
                             # but we need a max-heap
                             # so I added a minus sign in front of every score

selection = []
while candidates and candidates[0][0] < 0:  # while there is a candidate with positive score
    score, sentence, count = heapq.heappop(candidates)  # greedily selecting best candidate
    selection.append(sentence)
    target = target - count                             # update target by removing characters already accounted for
    candidates = [(-get_score(c,target), s, c) for _,s,c in candidates]  # update scores of remaining candidates
    heapq.heapify(candidates)                       # reorder candidates according to new scores

# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']

# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})

# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})

# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})

Explanations:

Python's Counter( ) transforms a sentence into a map character -> frequency;
For two Counters a and b, a & b is multiset-intersection, and a - b is multiset-difference;
For a Counter a, sum(a.values()) is the total count (the sum of all frequencies);
heapq.heapify transforms a list into a min-heap, which is a data structure that allows easy access to the element with minimum score. We actually want the sentence with maximum score, not minimum, so I replaced all the scores with negative numbers.

Non-optimality of the greedy algorithm

I should mention that this greedy algorithm is an approximation algorithm. At every iteration, it chooses the sentence with the highest score; but there is no guarantee that the optimal solution actually contains that sentence.

It is easy to build an example where the greedy algorithm fails to find the optimal solution:

target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})

sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'abcdefghijklm',
    'nopqrstuvwxyz'
]

With this target, the scores are as follow:

[
    (17, 'The quick brown fox jumps over the lazy dog.'),
    (13, 'abcdefghijklm'),
    (13, 'nopqrstuvwxyz')
]

The two "half-alphabets" have a score of 13 each, because they contain 13 letters of the alphabet. The sentence "The quick brown fox..." has a score of 17 = 26 - 9, because it contains the 26 letters of the alphabets, plus 9 excess letters (for instance, there are 3 excess 'o' and 2 excess 'e').

The optimal solution, obviously, is to cover the target perfectly with the two halves of the alphabet. But our greedy algorithm will select the "quick brown fox" sentence first, because it has a higher score.