Let's say I want to remove the word "tree" in every string in a Pandas dataframe column.
I would specify the substring(s) I want removed in a list. And then use replace
and join
on the column, as per below:
remove_list = ['\tree\s']
df['column'] = df['column'].str.replace('|'.join(remove_list ), '', regex=True).str.strip()
The reason I add a \s
to tree is because there may be words like treehouse or backstreet. So I want to replace the word only if it ends with a space, so that I don't end up with words like "house" or "backst".
However I noticed that when I run this code, it misses "tree"s that are at the end of the string, because there is no space after it. Hence, it doesn't get removed. Any idea on how I can account for those?
CodePudding user response:
Actually, I think the logic you want here is:
remove_list = ['tree']
terms = r'\s*\b(?:' '|'.join(remove_list) r')\b\s*'
df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()
Note that the regex pattern used above is, for a one word term list, \s*\b(?:tree)\b\s*
. This will match only the exact word tree
and not when tree
appears as a substring of another word. We also attempt to grab any spaces on either side of the word. Then, we replace with just a single space, and trim the column to make sure there are no stray spaces at the start or end.
Edit:
To address the edge case put forth by @user2357112, consider the following input:
apple tree tree squirrel
In this case, the above solution would leave behind two spaces in between apple
and squirrel
. We can get around this by expanding our regex pattern to allow for multiple consecutive keyword matches:
terms = r'\s*\b(?:' '|'.join(remove_list) r')\b(?: \b(?:' '|'.join(remove_list) r'))*\b\s*'
df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()
Here we are using the following regex pattern:
\s*\b(?:tree)\b(?: \b(?:tree))*\b\s*