I have a string that looks something likes this:
******************************************************************************** This e-mail and any files transmitted with it, are confidential and are intended solely for the use of the individual or entity to whom they are addressed SY.******************************************************************************** This e-mail and any files transmitted with it, are confidential and are intended solely for the use of the individual or entity to whom they are addressed SY.******************************************************************************** This e-mail and any files transmitted with it, are confidential and are intended solely for the use of the individual or entity to whom they are addressed SY.
The string starts with ****** and ends with SY.
The regex that I am using here to remove the above recurring parts of strings is ^\*.*.*.SY.
which I hope is correct.
This appears multiple times in the email data I am working with and I need to remove instance of these wherever they appear and for that I am using the following code:
def clean_desc(text):
text = re.sub(r"^\*.*.*.SY.","",text)
return text
df['clean'] = df['raw'].apply(lambda x: clean_desc(x))
Not sure wny this is not working. Where am I wrong here?
CodePudding user response:
df['clean'] = df['raw'].str.replace(r'\*.*?SY', '', regex=True)
The regexp matches from a *
to the first SY
after it. The ?
after .*
makes it non-greedy, so it won't match across multiple occurrences.