I have the following column which consists of email subject headers:
Subject
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations
What I want to achieve is:
Subject
Transport enquiry
0001 || Copy of enquiry
Model - Jan
Calculations
and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest
def clean_subject_prelim(text):
text = re.sub(r'^EXT \|\| $' , '' , text)
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Why this is not working, what am I missing here?
CodePudding user response:
You can use
pattern = r"""(?mx) # MULTILINE mode on
^ # start of string
(?: # non-capturing group start
EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
| # or
SV:\s*\[EXTERNAL]# SV: [EXTERNAL]
) # non-capturing group end
\s* # zero or more whitespaces
"""
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)
See the regex demo.
Since the re.X
((?x)
) is used, you should escape literal spaces and #
chars, or just use \s*
or \s
to match zero/one or more whitespaces.
CodePudding user response:
Get rid of the $
sign in the first expression and switch some of regex expressions from place. Like this:
import pandas as pd
import re
def clean_subject_prelim(text):
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^EXT \|\|' , '' , text)
text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
return text
data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}
df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))