Remove substring only if followed by space or nothing (i.e. end of string)

Let's say I want to remove the word "tree" in every string in a Pandas dataframe column.

I would specify the substring(s) I want removed in a list. And then use replace and join on the column, as per below:

remove_list = ['\tree\s']

df['column'] = df['column'].str.replace('|'.join(remove_list ), '', regex=True).str.strip()

The reason I add a \s to tree is because there may be words like treehouse or backstreet. So I want to replace the word only if it ends with a space, so that I don't end up with words like "house" or "backst".

However I noticed that when I run this code, it misses "tree"s that are at the end of the string, because there is no space after it. Hence, it doesn't get removed. Any idea on how I can account for those?

CodePudding user response：

Actually, I think the logic you want here is:

remove_list = ['tree']
terms = r'\s*\b(?:'   '|'.join(remove_list)   r')\b\s*'

df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()

Note that the regex pattern used above is, for a one word term list, \s*\b(?:tree)\b\s*. This will match only the exact word tree and not when tree appears as a substring of another word. We also attempt to grab any spaces on either side of the word. Then, we replace with just a single space, and trim the column to make sure there are no stray spaces at the start or end.

Edit:

To address the edge case put forth by @user2357112, consider the following input:

apple tree tree squirrel

In this case, the above solution would leave behind two spaces in between apple and squirrel. We can get around this by expanding our regex pattern to allow for multiple consecutive keyword matches:

terms = r'\s*\b(?:'   '|'.join(remove_list)   r')\b(?: \b(?:'   '|'.join(remove_list)   r'))*\b\s*'
df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()

Here we are using the following regex pattern:

\s*\b(?:tree)\b(?: \b(?:tree))*\b\s*