i have a question, with the following re.sub() method i am able to extract all mail addresses from a *.txt file.
emails = re.findall(r"[a-z0-9\.\- _] @[a-z0-9\.\- _] \.[a-z] ", file)
Now, i'd like to remove all punctuation marks from this *.txt, because there is also some text in it.
I have removed the punctuation marks with
output = re.sub(r'^\w\s', '', file)
but this function also removes the punctuation marks from the email addresses in the text. How do i write an exception in this re.sub for the mail addresses?
Thank you.
CodePudding user response:
You can use
re.sub(r"([a-z0-9.\- _] @[a-z0-9.\- _] \.[a-z] )|[^\w\s]", r"\1", file)
Here, the email pattern is captured into Group 2 and the \1
backreference in the replacement pattern restores the email text in the resulting string.
Note [^\w\s]
matches any char other than a word and whitespace chars, and thus does not match an underscore. If you want to remove underscores, too, add it as an alternative:
re.sub(r"([a-z0-9.\- _] @[a-z0-9.\- _] \.[a-z] )|[^\w\s]|_", r"\1", file)
CodePudding user response:
Assuming that email addresses are composed of word characters and a @
character, the following regex should work:
(\w*@\w*([^\w\s])|([^\w\s])\w*@\w*)