How to remove text between two specific words in a dataframe column by python-CodePudding

I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.

I am currently trying

   a = df['textcol']
   import re
   df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)

this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it

CodePudding user response：

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re, it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape:

start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)