Home > Net >  How to join dataframes on columns of lists with 'contains' conditions in python
How to join dataframes on columns of lists with 'contains' conditions in python

Time:09-21

I have two dataframes that look like this:

Dataframe 1:

antecedents consequents
0   (20679) (15056BL)
1   (20675) (20676)
2   (20675) (20677)
3   (20723) (20724)
4   (22356) (20724)
... ... ...
178 (22355, 20724, 22356)   (20719)
179 (20724, 22356, 20719)   (22355)
180 (21212, 84991, 84992)   (21977)
181 (21212, 21977, 84992)   (84991)
182 (84991, 21977, 84992)   (21212)

Dataframe 2:

Invoice Customer ID StockCode
0   489434  13085.0 [85048, 79323P, 79323W, 22041, 21232, 22064, 2...
1   489435  13085.0 [22350, 22349, 22195, 22353]
2   489436  13078.0 [48173C, 21755, 21754, 84879, 22119, 22142, 22...
3   489437  15362.0 [22143, 22145, 22130, 21364, 21360, 21351, 213...
4   489438  18102.0 [21329, 21252, 21100, 21033, 20711, 21410, 214...

DataFrame 1 sample:

pd.DataFrame( {'antecedents' : [('20679'), ('85048'), ('22143'), ('22065','22138'), ('20754','21035','22041')],
               'consequents' : [('20676'), ('20719'), ('22355'), ('20724'), ('212212')]
})

DataFrame 2 sample:

pd.DataFrame( {'Customer ID' : [13085, 13078, 15362, 18102, 12682, 18087, 18087, 13635, 14110],
               'StockCode' : [
                           ['85048', '79323P', '79323W', '22041', '21232', '22064'], ['22350', '22349', '22195', '22353'], ['48173C', '21755', '21754', '84879', '22119', '22142'],
                             ['22143', '22145', '22130', '21364', '21360', '21351'], ['21329', '21252', '21100', '21033', '20711', '21410'], ['22065', '22138', '22139', '22352', '85014A', '85014B'],
                              ['22321', '22138', '84029E', '22111'], ['21955', '22111', '22296', '84899E', '22271', '22272'],['20754', '21035', '22041', '82001S', '82580', '85150']
                             ]  
})

I need to join them on the following condition: all elements in antecedents (df1) are present in StockCode (df2).

So if customerId 13085 has products 29679, 20675 and 20723, consequents 15056BL, 20676, 20677 and 20724 should show for them.

I am able to reach the result with a for loop after I convert antecedents to a list of lists (ant_list):

for list2 in ant_list:
    for item, row in df_grouped.iterrows():
        col_name = str(list2)
        if all(elem in row['StockCode'] for elem in list2):
            df_grouped.loc[item,col_name] = 1

To get something like this:

Invoice Customer ID StockCode ['20971'] ['22064'] ['79323P'] ['84032A']
489434 13085.0. [85048,... NaN. 1.0. 1.0. NaN.

However I need to find a vectorized way, or any other better way, so that I can escalate the solution to bigger datasets and I'm stuck.

CodePudding user response:

try:

df = df1.explode('antecedents').merge(df2.explode('StockCode'), right_on='StockCode', left_on='antecedents', how='left')

df
    antecedents consequents Customer ID StockCode
0   20679       20676       NaN         NaN
1   85048       20719       13085.0     85048
2   22143       22355       18102.0     22143
3   22065       20724       18087.0     22065
4   22138       20724       18087.0     22138
5   22138       20724       18087.0     22138
6   20754       212212      14110.0     20754
7   21035       212212      14110.0     21035
8   22041       212212      13085.0     22041
9   22041       212212      14110.0     22041

a = df.groupby('Customer ID')['StockCode'].apply(list).reset_index()
b= df.groupby('Customer ID')['consequents'].apply(list).reset_index()
c= df.groupby('Customer ID')['antecedents'].apply(list).reset_index()
a.merge(b).merge(c)

    Customer ID StockCode               consequents                 antecedents
0   13085.0     [85048, 22041]          [20719, 212212]             [85048, 22041]
1   14110.0     [20754, 21035, 22041]   [212212, 212212, 212212]    [20754, 21035, 22041]
2   18087.0     [22065, 22138, 22138]   [20724, 20724, 20724]       [22065, 22138, 22138]
3   18102.0     [22143]                 [22355]                     [22143]
  • Related