Home > Mobile >  Filter a dataframe's rows if a list of words or phrases are not included in the column even if
Filter a dataframe's rows if a list of words or phrases are not included in the column even if

Time:04-09

I understand that to filter a dataframe if a word is not in a dataframe, you employ the below:

df[~df['Job Name'].isin(remover_rows)]

The issue is in my case, the dataframe may contain multiple words like so:

 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Finance and Office Administrator',
 'Office/Accounts Administrator',
 'Office Administrator',
 'Temporary Office Administrator',
 'Accounts and Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Admin - Customer Support',
 'Office Administrator',
 'Office Administrator - London - Spanish Speaking'

The above solution only works if the specific words are indicated in the list of words to be excluded so for example, in the list above, if "Office Administrator" is in the remover_rows variable, then those rows will not show. But what if I want to remove rows in a Job Name column that contains ['Finance', 'Spanish'] even if it contains other words too? For example, Id be expecting rows containing 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking' in the Job Name column to not show up.

CodePudding user response:

You can use Series.str.contains withe regular expression:

import re

words = ["Finance", "Spanish"]

x = df["Job Name"].str.contains(
    "|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])

Prints:

                             Job Name
0                Office Administrator
1                Office Administrator
2                Office Administrator
4       Office/Accounts Administrator
5                Office Administrator
6      Temporary Office Administrator
7   Accounts and Office Administrator
8                Office Administrator
9                Office Administrator
10               Office Administrator
11               Office Administrator
12    Office Admin - Customer Support
13               Office Administrator

Or without re:

words = ["Finance", "Spanish"]

x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])
  • Related