Filtering Pandas DF Containing All Elements of a List-CodePudding

I have a data frame with a number of columns. One of the columns ("BLOCKED_YES") contains country codes strings e.g. 'CA CH HK EU UK DE'. The country codes can have a random order.

I want to isolate all rows which contain ALL of the elements of a ban list. For example if the ban list = ['CA','US'], I want to capture all the df rows with both these two elements present (regardless of order) in the BLOCKED_YES column.

I currently have the following

df = df[['STATUS','BLOCKED_YES','NAME']]
ban_list= ['US','CA']
df2 = df[df['BLOCKED_YES'].isin(ban_list)]

In the above case I get a df where the BLOCKED_YES columns contains strings with ANY of the elements in the ban_list, I want rows which contain all of the elements in the list, not just one.

CodePudding user response：

I think you're best off using sets for this instead of lists, as sets are conceptually unordered to begin with.

Let's assume your raw data is:

import pandas as pd

df = pd.DataFrame([
    (1, 'CA CH HK EU UK DE'),
    (2, 'CA CH HK EU UK DE US')
], columns=['id', 'countries'])

And your ban list is actually a set instead of a list:

ban_list= {'US','CA'}

You can turn every cell content into a set as well like so:

df['countries'].str.split(' ').apply(set)

and then use set.issubset to check if all elements are in (i.e. if the ban list is a subset of each entry). Chained together with apply, it can be very concise:

df[df['countries'].str.split(' ').apply(set).apply(ban_list.issubset)]