I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not
condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc
but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?
CodePudding user response:
Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:
or_satisfied = False
for w in doc:
if word in wAND: wAND.remove(word)
if not or_satisfied and word in wOR: or_satisfied = True
if word in wNOT: return False
return or_satisfied and not wAND
CodePudding user response:
You can build regexps for the word bags you have, and use them:
def make_re(word_set):
return re.compile(
r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
flags=re.I,
)
wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)
def re_match(doc):
if not wOR_re.search(doc):
return False
if wNOT_re.search(doc):
return False
found = set()
expected = len(wAND)
for word in re.finditer(r'\w ', doc):
found.add(word)
if len(found) == expected:
break
return len(found) == expected
A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that
- documents don't need to be cleaned (since the
\b
s limit matches to words andre.I
deals with case normalization) - regexps are run in native code, which tends to always be faster than Python
name='original' iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match' iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match' iters=10000 time=0.203 iters_per_sec=49363.58
where bag_match
is my original comment suggestion of using set intersections:
def bag_match(doc):
bag = set(clean_doc(doc))
return (
(bag.intersection(wOR) or not wOR) and
(bag.issuperset(wAND) or not wAND) and
(not bag.intersection(wNOT) or not wNOT)
)
If you already have cleaned the documents to an iterable of words (here I just slapped @lru_cache
on clean_doc
, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:
name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60