I have a data frame with a number of columns. One of the columns ("BLOCKED_YES") contains country codes strings e.g. 'CA CH HK EU UK DE'. The country codes can have a random order.
I want to isolate all rows which contain ALL of the elements of a ban list. For example if the ban list = ['CA','US'], I want to capture all the df rows with both these two elements present (regardless of order) in the BLOCKED_YES column.
I currently have the following
df = df[['STATUS','BLOCKED_YES','NAME']]
ban_list= ['US','CA']
df2 = df[df['BLOCKED_YES'].isin(ban_list)]
In the above case I get a df where the BLOCKED_YES columns contains strings with ANY of the elements in the ban_list, I want rows which contain all of the elements in the list, not just one.
CodePudding user response:
I think you're best off using set
s for this instead of lists, as sets are conceptually unordered to begin with.
Let's assume your raw data is:
import pandas as pd
df = pd.DataFrame([
(1, 'CA CH HK EU UK DE'),
(2, 'CA CH HK EU UK DE US')
], columns=['id', 'countries'])
And your ban list is actually a set instead of a list:
ban_list= {'US','CA'}
You can turn every cell content into a set as well like so:
df['countries'].str.split(' ').apply(set)
and then use set.issubset
to check if all elements are in (i.e. if the ban list is a subset of each entry). Chained together with apply
, it can be very concise:
df[df['countries'].str.split(' ').apply(set).apply(ban_list.issubset)]