I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.
I am currently trying
a = df['textcol']
import re
df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)
this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it
CodePudding user response:
You need to use Series.str.replace
directly:
df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)
Here, (?s)
stands for re.DOTALL
/ re.S
in order not to have to import re
, it is their inline modifier version. The .*?
matches any zero or more chars, as few as possible.
If Original
and Subject
need to be passed as variables containing literal text, do not forget about re.escape
:
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)