regular expression conditional to extract domain name of email address-CodePudding

I want to extract the domain name of an email address but only the part before the domain.

@([\w] )

with an email address like "[email protected]", I can extract telenet. However, when the email address is "[email protected]" I get "invoice" instead of "telenet".

I tried with a condition, but I can't get it working.

CodePudding user response：

According to rfc522, which defines the format of a valid email address, the local part of an email address, i.e. the part that precedes @domain_name, can contain a '@' character if it is within a quoted string. So you want to be sure that you start scanning following the final '@'.

The following regex is specific and will scan into capture group 1 the next to last level of the domain name:

[@.]([^.@] )\.([^.@] )$

[@.] - Matches either a '@' or a '.'. This matches the start of a new domain level. The rest of the regex will guarantee that there are no '@' characters in the remaining characters to be scanned.
([^.@] ) - Scans in to capture group 1 one or more characters that are either '.' nor '@'.
. - Matches a '.'.
([^.@] ) - Matches one or more characters that are either '.' nor '@'.
$ - Matches the end of string.

See Regex demo.

A second approach uses a simpler regex to first scan whatever follows the final '@' to capture the full domain:

(?<=@)[^@] $

(?<=@) - A Positive lookbehind assertion stating that the preceding character is a '@'.
[^@] - Matches 1 or more non-'@' characters.
$ - Matches the end of the string.

See Regex Demo

If your regex engine does not support lookbehind assertions, then use instead the following regex in which case the domain will be in capture group 1:

@([^@] )$

Then you can split the scanned domain on the . character and select any N parts of the domain as follows (the code is Python):

import re

email = "[email protected]"

m = re.search(r'@([^@] )$', email)
if m:
    # We have a match
    domain = m.group(0)
    domain_parts = domain.split('.')
    # the penultimate part: 'telnet'
    print(domain_parts[-2])
    # the last 2 parts: telnet.be
    print('.'.join(domain_parts[-2:]))

Prints:

telenet
telenet.be

See Python demo