Fastest way to parse combinations of words in a string in python-CodePudding

new to Python. I need to parse a body of text and check if an element in a dict exists in that body. So far I've come up with using itertools:

from string import punctuation
from itertools import combinations

def retrieve_text_from_body(body, dict):
  a_list = []
  stripped_body = [i.strip(punctuation) for i in body.split()]
    for i in range(1, len(stripped_body) 1):
        for combination in combinations(stripped_body, i):
            if combination in dict:
                a_list.append(dict[combination])
  return a_list

Example input and output:

Body = "The big fat cat sits" (Could be up to 1000 words)
Dict = ["Big Fat", "Little Mouse", "Cat Sits"] (Could be any length)

Part of combinations that form:
['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']

Output: ["Big Fat", "Cat Sits"]

The code above is just very slow in my usecase since I have to do it across millions of rows in a table. I'm wondering if there is a faster approach?

CodePudding user response：

You are searching each combination of (any number of) 30 words in a list of 70k strings. The number of generated combinations is 2**30, or around 1 billion, so you are performing 1 billion searches on 70k items - since most of the searches will fail, we can say you are performing around 70k billions comparisons. Yes, it will take ages - the time complexity is O(2**w * s), i.e. worse than exponential. Let's make a totally arbitrary hypothesis, and say it takes one day to perform this.

One better-performing approach would be sorting your dict before, and then use a binary search. Still 1 billion of combinations to check, but at least the search is logarithmic in time, so we have now around 16 billion (slightly slower, let's say 1.5x) comparisons, and a time complexity O(2**w * log s). Still not what we're looking for, but instead of one day we would now spend some 30 seconds (plus the initial sort, but the overhead is irrelevant)

If you can change your dict datatype, convert the list of strings in a set of strings. This way you have O(1) search time, so just O(2**w) overall. Still exponential, but the 30 seconds above now become 1.25.

The other way to attain a dramatic change of speed, of course, is to reverse the logic. Do not create the combos, and for each dict entry simply check if it's in the body : O(s * w). Unfortunately we cannot use here the "set" optimization, however with your data this solution should be about 4 times faster than the set-based one, so our 1.25 secs are now 0.3 or so.

Just to have an idea of the actual run time we could expect I did the following

bigdict = list(range(70000))
words=list(range(100000,100030)) #worst case: no word is in bigdict
timeit('for b in bigdict:x=b in words', globals=globals(), number=1)
0.021891699987463653

CodePudding user response：

With regex, you can solve this problem in a simple way. It is true that regexes are quite slow as they work by accessing all positions of the string multiple times.

import regex as re

def search_in_body(body, structures):
    found_structures = [
        f for f in structures if re.search(f, body, re.IGNORECASE)
    ]
    return found_structures


if __name__ == "__main__":
    Body = "The big fat cat sits"
    Dict = ["Big Fat", "Little Mouse", "Cat Sits"] 
    print(search_in_body(Body, Dict))

>>> ['Big Fat', 'Cat Sits']