matching string in python-CodePudding

The question I have is regarding the identification of a series of string in python. Let me explain what I am trying to do:

A string such as tom and jerry could also be written as in lowercase

tom n jerry
tom_jerry
tom & jerry
tom and jerry

and so on and so forth. As you can see there in the minimal example, there were 4 possible ways where even if I created a dictionary with these 3 ways, i will miss out on a string containing tom _ jerry. What can I do to recognize tom and jerry, creating many rules seems very inefficient. Is there a more efficient way to do this ?

CodePudding user response：

This will find any of those combinations in a sentence:

combo = "tom n jerry"
string = "This is an episode of"   combo   "that deals with something."
substring = string[string.find("tom"):string.find("jerry") 5]
print(substring)

CodePudding user response：

You could attempt this using a sequence matcher.

from difflib import SequenceMatcher

def checkMatch(firstWord: str, secondWord: str, strictness: float):
    ratio = SequenceMatcher(None, firstWord.strip(), secondWord.strip()).ratio()
    if ratio > strictness:
        return 1
    return 2

if __name__ == "__main__":
    originalWord = "tom and jerry"
    toMatch = "tom_jerry" # chose this one as it is the least likely in your example
    toMatch.lower() # easier to match if you lower or upper both the original and the match
    strictness = 0.6 # a strictness of 0.6 would mean the words are generally pretty similiar
    print(checkMatch(originalWord, toMatch, strictness))

You can learn more about how sequence matcher works here: https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc