The following regex pattern does almost everything I need it to do, including catching contractions:
re_pattern = "[a-zA-Z] \\'?[a-zA-Z] "
However, if I enter the following code:
sent = "I can't understand what I'm doing wrong or if I made a mistake."
re.findall(re_pattern, sent)
It doesn't pick up one-letter words, such as I
or a
:
["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']
CodePudding user response:
You're trying to match at least 2 character words, as the second also requires at least one match, with an optional '
in between.
Changing it to an optional * will do it
>>> re_pattern = "[a-zA-Z] \\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
CodePudding user response:
You need to use
re_pattern = r"[a-zA-Z] (?:'[a-zA-Z] )?"
See the regex demo and the Python demo:
import re
re_pattern = r"[a-zA-Z] (?:'[a-zA-Z] )?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
Note: If you needn't extract letter sequences glued to _
or digits, use word boundaries:
re_pattern = r"\b[a-zA-Z] (?:'[a-zA-Z] )?\b"
See the regex demo. And if you plan to match any Unicode words:
re_pattern = r"\b[^\W\d_] (?:'[^\W\d_] )?\b"
See the regex demo.
Ah, and if you want to also match digits and underscores as part of "words", just use
re_pattern = r"\w (?:'\w )*"
The *
after (?:'\w )
allows matching words like rock'n'roll
.