Python: How to determine if a string has an exact match with any string from the list-CodePudding

Assume that I have the list of phrases to compare against as: ["hello", "hi", "bye"]

I want to return true if my text has any of this words in it, but with exact match. Meaning that: hi there, how are you? returns true, but hithere, how are you? returns false.

So far I have the below code:

phrases = ['hello', 'hi', 'bye']    
    
def match(text: str) -> bool:
    if any(ext in text for ext in phrases):
        return True
    else:
        return False

But it returns true for both inputs.

I also found out about this below function which returns the exact matches from the string, but I want to compare against a list of phrases, not a single string. I know I can iterate through the list of words and check one by one, but hoping to find a solution that is better performing.

import re
print(re.findall('\\bhello\\b', "hellothere, how are you?"))

Update: By exact match, I mean word boundary. That can be space, punctuation, etc. Just like what \b is

CodePudding user response：

A regex of the form r"(abc|ef|xxx)" will match with "abc", "ef", or "xxx". You can create this regex by using the string concatenation as below. Note re.search returns None if no match is found.

import re

phrases = ['hello', 'hi', 'bye']
def match(text):
  r = re.search(r'\b({})\b'.format("|".join(phrases)), text)
  return r is not None

match("hi there, how are you?"), match("hithere, how are you?")
# (True, False)

CodePudding user response：

One possible solution is to first split() the sentence into words, then strip() any punctuation marks and alike for each word and finally check if that word matches a word in the list. Actually you should not use a list but a Set which will enable lookups in constant (O(1)) time instead of linear (O(n)) time as is the case with lists.

phrases = ['hello', 'hi', 'bye']
phraseSet = set(phrases)

def match(text: str, word_set: set[str]) -> bool:
    words = text.split(" ")
    for word in words:
        stripped = word.strip(".?!,:")
        if stripped in word_set:
            return True
    return False

print(match("hi there, how are you?", phraseSet))
print(match("hithere, how are you?", phraseSet))

Obviously one could write the above solution in a more pythonic way.

CodePudding user response：

Depending on your exact needs, you can tweak this, but I think this does what you need:

import re

phrases = ['hello', 'hi', 'bye']
text = "Hi there, how are you? How did that Hi8 turn out? Hi, can you hear me? Hello? Uh... Bye!"
expression = rf'(?:^|(?<=\s))(?:{"|".join(phrases)})(?=[,\.!?;:\s]|$)'

result = re.findall(expression, text, flags=re.IGNORECASE)
print(result)

Output:

['Hi', 'Hi', 'Hello', 'Bye']

About that regular expression:

(?:^|(?<=\s)) says: in a non-capturing group ((?: )), check that there's the start of the line, or the previous character is a space character.
(?:{"|".join(phrases)}) Since the expression is an f-string (and a raw string, rf'something') the part between {} gets replaced by evaluating the Python expression, so hello|hi|bye in this case. The expression will match any of the words, in a non-capturing group.
(?=[,\.!?;:\s]|$) and at the end, there's a lookahead checking that the next character is either interpunction or a space, or the end of the string follows. (Note that the . needs to be escaped with a backslash for the regex engine, otherwise it would match "any character")