Home > Software engineering >  How to efficiently search for similar substring in a large text python?
How to efficiently search for similar substring in a large text python?

Time:12-08

Let me try to explain my issue with an example, I have a large corpus and a substring like below,

corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

Both the substring and corpus are very similar but it not exact,

If I do something like,

import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar

In the corpus the substring is like below which is bit different from the substring I have because of that regular expression search is failing, can someone suggest a really good alternative for similar substring lookup,

until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now

I did try difflib library but it was not satisfying my use-case.

Some background information,

The substring I have right now, is obtained some time ago from pre-processed corpus using this regex re.sub("[^a-zA-Z]", " ", corpus).

But now I need to use that substring I have to do the reverse lookup in the corpus text and find the start and ending index in the corpus.

CodePudding user response:

You don't actually need to fuzzy match all that much, at least for the example given; text can only change in spaces within substring, and it can only change by adding at least one non-alphabetic character (which can replace a space, but the space can't be deleted without a replacement). This means you can construct a regex directly from substring with wildcards between words, search (or finditer) the corpus for it, and the resulting match object will tell you where the match(es) begin and end:

import re

# Allow any character between whitespace-separated "words" except ASCII
# alphabetic characters
ssre = re.compile(r'[^a-z] '.join(substring.split()), re.IGNORECASE)

if m := ssre.search(corpus):
    print(m.start(), m.end())

    print(repr(m.group(0)))

Try it online!

which correctly identifies where the match began (index 217) and ended (index 771) in corpus; .group(0) can directly extract the matching text for you if you prefer (it's uncommon to need the indices, so there's a decent chance you were asking for them solely to extract the real text, and .group(0) does that directly). The output is:

217 771
"until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now"

If spaces might be deleted without being replaced, just change the quantifier to * (the regex will run a little slower since it can't short-circuit as easily, but would still work, and should run fast enough).

If you need to handle non-ASCII alphabetic characters, the regex joiner can change from r'[^a-z] ' to the equivalent r'[\W\d_] ' (which means "match all non-word characters [non-alphanumeric and not underscore], plus numeric characters and underscores"); it's a little more awkward to read, but it handles stuff like é properly (treating it as part of a word, not a connector character).

While it's not going to be as flexible as difflib, when you know no words are removed or added, it's just a matter of spacing and punctuation, this works perfectly, and should run significantly faster than a true fuzzy matching solution (that has to do far more work to handle the concept of close matches).

CodePudding user response:

While you can not find an exact match if the strings differ by just even one character - you can find similar strings.

So here I made use of the builtin difflib SequenceMatcher in order to check for the similarity of two differing strings.

In case that you need the indices of where the substring starts within the corpus - that could be easily added. In case you have any questions comment pls.

Hope it helps. - Adapted to your edited question - Implemented @ShadowRanger's note

import re
from difflib import SequenceMatcher


def similarity(a, b) -> float:
    """Return similarity between 2 strings"""
    return SequenceMatcher(None, a, b).ratio()


def find_similar_match(a, b, threshold=0.7) -> list:
    """Find string b in a - while the strings being different"""
    corpus_lst = a.split()
    substring_lst = b.split()

    nonalpha = re.compile(r"[^a-zA-Z]")
    start_indices = [i for i, x in enumerate(
        corpus_lst) if nonalpha.sub("", x) == substring_lst[0]]
    end_indices = [i for i, x in enumerate(
        corpus_lst) if nonalpha.sub("", x) == substring_lst[-1]]

    rebuild_substring = " ".join(substring_lst)
    max_sim = 0
    for start_idx in start_indices:
        for end_idx in end_indices:
            corpus_search_string = " ".join(corpus_lst[start_idx: end_idx])
            sim = similarity(corpus_search_string, rebuild_substring)
            if sim > max_sim:
                result = [start_idx, end_idx]

    return result

The results are of calling find_similar_match(corpus, substring):

Found a match with similarity : 0.8429752066115702
[38, 156]

CodePudding user response:

Not exactly the best solution but this might help.

match = SequenceMatcher(None, corpus, substring).find_longest_match(0, len(corpus), 0, len(substring))

print(match)
print(corpus[match.a:match.a   match.size])
print(substring[match.b:match.b   match.size])

CodePudding user response:

This may help you to visualise the similarity of the two strings based on the

percentage of words in the corpus that are in the substring.

The code below aims to :

  • use the substring as a bag of words
  • finds these words in the corpus (and if found - makes them uppercase)
  • display the modifications in the corpus
  • calculate the percentage of modified words in the corpus
  • show the number of words in substring that were not in the corpus

This way you can see which of the substring words were matched in the corpus, and then identify the percentage similarity by word (but not necessarily in the right order).

Code:

import re
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

sub_list = set(substring.split(" "))
unused_words = []
for word in sub_list:
    if word in corpus:
        r = r"\b"   word   r"\b"
        ru = f"{word.upper()}"
        corpus = re.sub(r, ru, corpus)
    else:
        unused_words.append(word)

print(corpus)

lower_strings = len(re.findall("[a-z'] ", corpus))
upper_strings = len(re.findall("[A-Z'] ", corpus))
print(f"\nWords Matched = {(upper_strings)/(upper_strings   lower_strings)*100:.1f}%")
print(f"Unused Substring words: {len(unused_words)}")

Output:

very quick service, polite workers(cory, I think THAT'S his name), I
basically just drove there AND got A quote(which SEEMS TO be very fair
priced), THEN DROPPED OFF MY CAR 4 days later(because THEY WERE fully
booked UNTIL THEN), THEN I DROPPED OFF MY CAR ON MY APPOINTMENT DAY, THEN
THE SAME DAY THE SHOP CALLED ME AND NOTIFIED ME THAT THE THE JOB IS DONE I
CAN GO PICKUP MY CAR. WHEN I GO CHECKED OUT MY CAR I WAS AMAZED BY THE JOB
THEY'VE DONE TO IT, AND THEY EVEN GAVE THAT DIRTY CAR A WASH( PROB EVEN
WAXED IT OR COATED IT, CUZ IT WAS SHINY AS HELL), TIRES SHINE, MATS WERE 
VACUUMED TOO. I GAVE THEM A DIRTY, BROKEN CAR, THEY GAVE ME BACK A WHAT 
SEEMS LIKE A BRAND NEW CAR. I'M HAPPY WITH THE RESULT, AND I WILL DEF HAVE 
ALL MY CAR'S WORK DONE BY THIS PLACE FROM NOW.

Words Matched = 82.1%
Unused Substring words: 0
  • Related