re.sub() exception for email addresses

i have a question, with the following re.sub() method i am able to extract all mail addresses from a *.txt file.

emails = re.findall(r"[a-z0-9\.\- _] @[a-z0-9\.\- _] \.[a-z] ", file)

Now, i'd like to remove all punctuation marks from this *.txt, because there is also some text in it.

I have removed the punctuation marks with

output = re.sub(r'^\w\s', '', file)

but this function also removes the punctuation marks from the email addresses in the text. How do i write an exception in this re.sub for the mail addresses?

Thank you.

CodePudding user response：

You can use

re.sub(r"([a-z0-9.\- _] @[a-z0-9.\- _] \.[a-z] )|[^\w\s]", r"\1", file)

Here, the email pattern is captured into Group 2 and the \1 backreference in the replacement pattern restores the email text in the resulting string.

Note [^\w\s] matches any char other than a word and whitespace chars, and thus does not match an underscore. If you want to remove underscores, too, add it as an alternative:

re.sub(r"([a-z0-9.\- _] @[a-z0-9.\- _] \.[a-z] )|[^\w\s]|_", r"\1", file)

CodePudding user response：

Assuming that email addresses are composed of word characters and a @ character, the following regex should work:

(\w*@\w*([^\w\s])|([^\w\s])\w*@\w*)