Home > Blockchain >  How to filter a dataframe with a list of tuples
How to filter a dataframe with a list of tuples

Time:11-01

I have a list of tuples like this one (PRODUCT_ID, COUNTRY_CODE):

[(1111, 'CO'),
 (2222, 'CO'),
 (1111, 'BR')]

and a dataframe like this one:

df = pd.DataFrame({
'COUNTRY_CODE': ['CO','CO','CO','BR','BR','BR','CO'], 
'VERTICAL_GROUP_ID': [2,2,3,2,3,3,3], 
'SUB_VERTICAL': ['SUPER','SUPER','HOME','LICOR','SPORTS','HOME','TECH'], 
'PRODUCT_ID': [1111,3333,1111,4444,1111,2222,2222], 
'SHOWN': [7,8,12,14,16,1,13], 
})

How can I filter the dataframe so that I get a resulting dataframe like this, filtered with only the values from the list of tuples by PRODUCT_ID and COUNTRY_CODE?

COUNTRY_CODE VERTICAL_GROUP_ID SUB_VERTICAL PRODUCT_ID SHOWN
CO 2 SUPER 1111 7
CO 3 HOME 1111 12
BR 3 SPORTS 1111 16
CO 3 TECH 2222 13

CodePudding user response:

Lets use Multiindex.isin

df[df.set_index(['PRODUCT_ID', 'COUNTRY_CODE']).index.isin(tup)]

Alternatively you can create a new dataframe from tuples then inner merge with the given dataframe

filters = pd.DataFrame(tup, columns=['PRODUCT_ID', 'COUNTRY_CODE'])
df.merge(filters)

  COUNTRY_CODE  VERTICAL_GROUP_ID SUB_VERTICAL  PRODUCT_ID  SHOWN
0           CO                  2        SUPER        1111      7
1           CO                  3         HOME        1111     12
2           BR                  3       SPORTS        1111     16
3           CO                  3         TECH        2222     13

CodePudding user response:

You can zip two columns : PRODUCT_ID and COUNTRY_CODE and check whether the values are in list of tuples or not and return desired df.

lst = [(1111, 'CO'),(2222, 'CO'),(1111, 'BR')]
m = [tpl in lst for tpl in zip(df['PRODUCT_ID'], df['COUNTRY_CODE'])]
# m -> [True, False, True, False, True, False, True]

df_new = df[m]
print(df_new)

Output:

  COUNTRY_CODE  VERTICAL_GROUP_ID SUB_VERTICAL  PRODUCT_ID  SHOWN
0           CO                  2        SUPER        1111      7
2           CO                  3         HOME        1111     12
4           BR                  3       SPORTS        1111     16
6           CO                  3         TECH        2222     13

Explanation:

>>> tuple(zip(df['PRODUCT_ID'], df['COUNTRY_CODE']))
((1111, 'CO'), # True  : in [(1111, 'CO'),(2222, 'CO'),(1111, 'BR')]
 (3333, 'CO'), # False : Not in [(1111, 'CO'),(2222, 'CO'),(1111, 'BR')]
 (1111, 'CO'),
 (4444, 'BR'),
 (1111, 'BR'),
 (2222, 'BR'),
 (2222, 'CO'))
  • Related