Home > Net >  regular expression for words that don't start with hyphen in python
regular expression for words that don't start with hyphen in python

Time:10-30

I need to do regular expression for words in python. I get a sentence and I need to check if there are words in it.

The words 'Hello', 'It's' will be in the list. The words '--Mom' or '-Mom' not be in the list. But 'Mom' will be in the list, because it separate the '-' from 'Mom' so 'Mom' consider 'Word' How can I get that word that start with '-' not be as a 'Word', like '--Mom' ?

def getWord():
  return"((^[A-Z])?[a-z] )((\-[a-z]*)*)(\')?[a-z]{0,2}"

text=r"""Hello Bob! It's Mary, your mother-in-law, the mistake is your parents'! --Mom""")
com = re.compile(rf"""((?P<WORD>{getWord()})), """,re.MULTILINE | re.IGNORECASE | re.VERBOSE | re.UNICODE)

lst=[(v, k) for match in com.finditer(text)
                for k, v in match.groupdict().items()
                if v is not None and k != 'SPACE']
print(lst)

CodePudding user response:

You may be overcomplicating this, and a regex find all search on \w already comes close to what you want here. To allow for possessives, just make 's an optional ending after each word. Also, to rule out words which are not preceded by whitespace (or are at the very start of the string) we can preface with the negative lookbehind (?<!\S).

text = "Hello Bob! It's Mary, your mother-in-law, the mistake is your parents! --Mom"
words = re.findall(r"(?<!\S)\w (?:'s)?", text)
print(words)

This prints:

['Hello', 'Bob', "It's", 'Mary', 'your', 'mother', 'in', 'law', 'the', 'mistake', 'is',
 'your', 'parents']
  • Related