Home > Net >  RegEx - Parsing name and last name from a string
RegEx - Parsing name and last name from a string

Time:11-17


I'm trying to parse all the instances of a name and a last name from a string in an outlook "to" convention, and save each one in a python list. I'm using Python 3.6.4.
For example, I would like the folllowing string:

"To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;"

to be parsed into:

['John Lennon','Paul McCartney']

I used Replace all words from word list with another string in python as a reference and came up with this code:

import re
prohibitedWords = [r'to:',r'To:','\b002',"\<(.*?)\>"]
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
big_regex = re.compile('|'.join(prohibitedWords))
the_message = big_regex.sub("", str(mystring)).strip()
print(the_message)

However, I'm getting the following results:

John Lennon  ; Paul McCartney  ;

This is not optimal as I'm getting lots of spaces which I cannot parse. In addition, I have a feeling this is not the optimal approach for this. Appreciate any advice.
Thanks

CodePudding user response:

Using re.sub and creating an alternation with these parts [r'to:',r'To:','\b002',"\<(.*?)\>"] you will replace the matches with an empty string.

If all the characters that you want to remove are gone, you will end up with a string John Lennon Paul McCartney as in this Python example where you don't know which part belongs where if you for example want to split.

Also removing the surrounding whitespace chars might lead to unexpected gaps or concatenation results when removing them.

You could make the match more specific by matching the possible leading parts, and capture the part that you want instead of replacing.

(?:\\b[Tt]o:|\b002;)\s*(. ?)\s*<[^<>@] @[^<>@] >
  • (?:\\b[Tt]o:|\b002;) Match either To to or a backspace char and 002
  • \s* Match optional whitespace chars
  • (. ?) Capture 1 or more chars in group 1
  • \s* Match optional whitspace chars
  • <[^<>@] @[^<>@] > Match a single @ between tags

See a regex demo and a Python demo.

For example

import re

pattern = "(?:\\b[Tt]o:|\b002;)\s*(. ?)\s*<[^<>@] @[^<>@] >"
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
print(re.findall(pattern, mystring))

Output

['John Lennon', 'Paul McCartney']
  • Related