I want to parse the following type of email address into three capture groups using Regex(Python):
johndoe @ gmail . com
The three capture groups are:
- the local part(
johndoe
); - The whole domain without whitespaces(
gmail.com
); - The domain name without withspaces(
gmail
).
This is the regular expression I wrote:
^([\w\s\-\/.!#$%&'* =?^_`{|}~] )@(([\s\w ] )\.[\w\s]{2,})$
Where:
- The first part(
([\w\s\-\/.!#$%&'* =?^_`{|}~] )
) captures the local part; - The second part(
(([\s\w ] )\.[\w\s]{2,})
) captures the whole domain and the domain name in two captures groups.
The expression works, but the problem is that both the 2nd and the 3rd capture groups have trailing whitespaces, i.e.:
- Group 1:
johndoe
; - Group 2:
gmail . com
; - Group 3:
gmail
.
Is there a way to trim whitespaces from nested capture groups?
CodePudding user response:
You can't avoid capturing the whitespace within group 2 if you want to capture group 3 inside of it.
So why not capture three seperate groups with the whitespace on the outside, then join $2\.$3
as necessary?
^\s*?([\w\-\/.!#$%&'* =?^_`{|}~] )\s*?@\s*?([\w ] )\s*?\.\s*?([\w]{2,})$
https://regex101.com/r/G1FvZV/1
CodePudding user response:
I suggest you keep \s
outside your character classes to match 3 parts separately and then concatenate 2nd and 3rd capture group separately in your python code if desired.
^\s*([\w/.!#$%&'* =?^`{|}~-] )\s*@\s*([\w-] )\s*\.\s*(\w{2,})\s*$
This will give 3 capture groups separately:
johndoe
gmail
com
RegEx Breakup:
^
: Start\s*
: Match 0 or whitespaces([\w/.!#$%&'* =?^
{|}~-] )`: Match 1 of these characters in capture group #1\s*
: Match 0 or whitespaces@
: Match a@
\s*
: Match 0 or whitespaces([\w-] )
: Match 1 of word character or hyphen in capture group #2\s*
: Match 0 or whitespaces\.
: Match a dot\s*
: Match 0 or whitespaces(\w{2,})
: Match 2 word characters in capture group #3\s*
: Match 0 or whitespaces$
: End