Find a string having highest partial match with other strings in a list-CodePudding

I have a list A with strings:

['assembly eye tow top', 'tow eye bolts', 'tow eye bolts need me']

I am trying to find a string strA that has the highest partial match score with all the strings in list A.

In other words, create a string that contains common tokens AND tokens that is present in most of the strings!

i.e. strA = 'tow eye bolts'

I tried the following:

Common substring (would not work since string needs to be contigous and has to maintain order)
Common subsequence (would not work since string needs to maintain order)
Find collocations (I do not know how to implement this for the desired output)
Use Python's fuzzywuzzy (I tried this but this only finds similarity scores between two strings)

CodePudding user response：

You could do something like this:

import collections

A = ['assembly eye tow top', 'tow eye bolts', 'tow eye bolts need me']

def highest_partial_match(A):
    d = collections.defaultdict(int)
    for string in A:
        for tok in set(string.split()): # count the numbe of words each token is in
            d[tok]  = 1

    extra_token, extra_token_num = '', 0
    s = []
    for tok, num in d.items():
        if num == len(A): # in all the words
            s.append(tok)
        elif num > extra_token_num and num > len(A) // 2: # get the best token for the extra one
            extra_token = tok
            extra_token_num = num

    if extra_token != '':
        s.append(extra_token)

    return ' '.join(s)

print(highest_partial_match(A))

Returns "tow eye bolts" by getting all the tokens that are in all the words and then picking the best extra token from the rest.

CodePudding user response：

i am not quite sure if this solution fully meets your reqirements.

it puts all words in a dictionary and counting the number of occurrences (value) . after this the result of the previous step will be sorted by the value. finally the most common occurrences are used to create the required string.

from math import ceil
words_rank = {}
inp = ['assembly eye tow top', 'tow eye bolts', 'tow eye bolts need me']
for phrase in inp:
    for word in phrase.split():
        words_rank[word] = words_rank.get(word, 0)   1

res = sorted(words_rank.items(), key = lambda x: x[1], reverse=True)
most_common = ceil(len(inp))
print(' '.join([word[0] for word in res if word[1] >= most_common]))

result: 'eye tow bolts'