I understand that to filter a dataframe if a word is not in a dataframe, you employ the below:
df[~df['Job Name'].isin(remover_rows)]
The issue is in my case, the dataframe may contain multiple words like so:
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Finance and Office Administrator',
'Office/Accounts Administrator',
'Office Administrator',
'Temporary Office Administrator',
'Accounts and Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Admin - Customer Support',
'Office Administrator',
'Office Administrator - London - Spanish Speaking'
The above solution only works if the specific words are indicated in the list of words to be excluded so for example, in the list above, if "Office Administrator" is in the remover_rows
variable, then those rows will not show. But what if I want to remove rows in a Job Name
column that contains ['Finance', 'Spanish']
even if it contains other words too? For example, Id be expecting rows containing 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking'
in the Job Name
column to not show up.
CodePudding user response:
You can use Series.str.contains
withe regular expression:
import re
words = ["Finance", "Spanish"]
x = df["Job Name"].str.contains(
"|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])
Prints:
Job Name
0 Office Administrator
1 Office Administrator
2 Office Administrator
4 Office/Accounts Administrator
5 Office Administrator
6 Temporary Office Administrator
7 Accounts and Office Administrator
8 Office Administrator
9 Office Administrator
10 Office Administrator
11 Office Administrator
12 Office Admin - Customer Support
13 Office Administrator
Or without re
:
words = ["Finance", "Spanish"]
x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])