My goal is to go over a text file and count the amount of time the phrase 'oh my god' is written. the phrase can appear in different ways like:'omg' 'oh-my-god', 'oh my god!'... I've tried this pattern but I miss some and it doesn't count all of them:
regex = re.compile(r'\b(omg|(oh[^A-Za-z0-9]my[^A-Za-z0-9]god)')
CodePudding user response:
You could write it like
\b(?:omg|oh([ -])my\1god)\b
The pattern matches:
\b
a word boundary(?:
nonca pture group for the alternativesomg
match literally|
Oroh
([ -])
capture group 1, match either a space or-
my
match literally\1
backreference to match the same as group 1god
match literally
)
close group 1\b
a word boundary
CodePudding user response:
Regex is always difficult, but this should work for you. A helpful resource to test and hone Regex can be found here: https://pythex.org/
I've done this solution such that you can use it with either a dictionary of phrases or an entire block of text (i.e. a string).
# Python 3.0
import re
target_string = "oh my god omg oh-my-god oh my god! oh my god! oh my god Oh my god OMG Oh-my-god Oh my god!" \
"Oh my god! Oh my god Oh My God OmG Oh-My-God Oh My God! Oh My god! Oh My God the ggod game game god godohmygod 132 !@#$%^&*()"
# Dictionary of phrases you want to search
dictionary = ['oh my god', 'omg', 'oh-my-god', 'oh my god!', 'oh, my god!' 'oh, my god', 'Oh my god', 'OMG', 'Oh-my-god',
'Oh my god!',
'Oh, my god!' 'Oh, my god', 'Oh My God', 'OmG', 'Oh-My-God', 'Oh My God!', 'Oh, My god!' 'Oh, My God',
'the ggod game', 'game', 'god godohmygod', '132', '!@#$%^&*()']
#Loop through the dictionary and print phrases that matches the regular expression
def match_phrase():
for p in dictionary:
regex = re.compile(
r"(?:oh|o|O)(?: |-|,|!|\.)*(?:my|m'y|m|M)*(?: |-|,|!|\.)*(?:god|g-o-d|GOD|g.o.d|God|gOD|GoD|g|G)(?: |-|,|!|\.)*")
if regex.match(p):
print("Matching words in dictionary: ",p)
#Loop through a string of text and return all matching results
def match_text_string():
regex = re.compile(
r"(?:oh|o|O)(?: |-|,|!|\.)*(?:my|m'y|m|M)*(?: |-|,|!|\.)*(?:god|g-o-d|GOD|g.o.d|God|gOD|GoD|g|G)(?: |-|,|!|\.)*")
result = re.findall(regex, target_string)
# print the matching word using group() method
print("Matching words in target_string: ", result)
if __name__ == '__main__':
match_phrase()
match_text_string()