The question I have is regarding the identification of a series of string in python. Let me explain what I am trying to do:
A string such as tom and jerry
could also be written as in lowercase
- tom n jerry
- tom_jerry
- tom & jerry
- tom and jerry
and so on and so forth. As you can see there in the minimal example, there were 4 possible ways where even if I created a dictionary with these 3 ways, i will miss out on a string containing tom _ jerry
. What can I do to recognize tom and jerry, creating many rules seems very inefficient. Is there a more efficient way to do this ?
CodePudding user response:
This will find any of those combinations in a sentence:
combo = "tom n jerry"
string = "This is an episode of" combo "that deals with something."
substring = string[string.find("tom"):string.find("jerry") 5]
print(substring)
CodePudding user response:
You could attempt this using a sequence matcher.
from difflib import SequenceMatcher
def checkMatch(firstWord: str, secondWord: str, strictness: float):
ratio = SequenceMatcher(None, firstWord.strip(), secondWord.strip()).ratio()
if ratio > strictness:
return 1
return 2
if __name__ == "__main__":
originalWord = "tom and jerry"
toMatch = "tom_jerry" # chose this one as it is the least likely in your example
toMatch.lower() # easier to match if you lower or upper both the original and the match
strictness = 0.6 # a strictness of 0.6 would mean the words are generally pretty similiar
print(checkMatch(originalWord, toMatch, strictness))
You can learn more about how sequence matcher works here: https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc