How do I get regex to capture one-letter words and multiple letter words?-CodePudding

The following regex pattern does almost everything I need it to do, including catching contractions:

re_pattern = "[a-zA-Z] \\'?[a-zA-Z] "

However, if I enter the following code:

sent = "I can't understand what I'm doing wrong or if I made a mistake."

re.findall(re_pattern, sent)

It doesn't pick up one-letter words, such as I or a:

["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']

CodePudding user response：

You're trying to match at least 2 character words, as the second also requires at least one match, with an optional ' in between. Changing it to an optional * will do it

>>> re_pattern = "[a-zA-Z] \\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

CodePudding user response：

You need to use

re_pattern = r"[a-zA-Z] (?:'[a-zA-Z] )?"

See the regex demo and the Python demo:

import re
re_pattern = r"[a-zA-Z] (?:'[a-zA-Z] )?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

Note: If you needn't extract letter sequences glued to _ or digits, use word boundaries:

re_pattern = r"\b[a-zA-Z] (?:'[a-zA-Z] )?\b"

See the regex demo. And if you plan to match any Unicode words:

re_pattern = r"\b[^\W\d_] (?:'[^\W\d_] )?\b"

See the regex demo.

Ah, and if you want to also match digits and underscores as part of "words", just use

re_pattern = r"\w (?:'\w )*"

The * after (?:'\w ) allows matching words like rock'n'roll.