I have two dataframes that look like this:
Dataframe 1:
antecedents consequents
0 (20679) (15056BL)
1 (20675) (20676)
2 (20675) (20677)
3 (20723) (20724)
4 (22356) (20724)
... ... ...
178 (22355, 20724, 22356) (20719)
179 (20724, 22356, 20719) (22355)
180 (21212, 84991, 84992) (21977)
181 (21212, 21977, 84992) (84991)
182 (84991, 21977, 84992) (21212)
Dataframe 2:
Invoice Customer ID StockCode
0 489434 13085.0 [85048, 79323P, 79323W, 22041, 21232, 22064, 2...
1 489435 13085.0 [22350, 22349, 22195, 22353]
2 489436 13078.0 [48173C, 21755, 21754, 84879, 22119, 22142, 22...
3 489437 15362.0 [22143, 22145, 22130, 21364, 21360, 21351, 213...
4 489438 18102.0 [21329, 21252, 21100, 21033, 20711, 21410, 214...
DataFrame 1 sample:
pd.DataFrame( {'antecedents' : [('20679'), ('85048'), ('22143'), ('22065','22138'), ('20754','21035','22041')],
'consequents' : [('20676'), ('20719'), ('22355'), ('20724'), ('212212')]
})
DataFrame 2 sample:
pd.DataFrame( {'Customer ID' : [13085, 13078, 15362, 18102, 12682, 18087, 18087, 13635, 14110],
'StockCode' : [
['85048', '79323P', '79323W', '22041', '21232', '22064'], ['22350', '22349', '22195', '22353'], ['48173C', '21755', '21754', '84879', '22119', '22142'],
['22143', '22145', '22130', '21364', '21360', '21351'], ['21329', '21252', '21100', '21033', '20711', '21410'], ['22065', '22138', '22139', '22352', '85014A', '85014B'],
['22321', '22138', '84029E', '22111'], ['21955', '22111', '22296', '84899E', '22271', '22272'],['20754', '21035', '22041', '82001S', '82580', '85150']
]
})
I need to join them on the following condition: all elements in antecedents (df1) are present in StockCode (df2).
So if customerId 13085 has products 29679, 20675 and 20723, consequents 15056BL, 20676, 20677 and 20724 should show for them.
I am able to reach the result with a for loop after I convert antecedents to a list of lists (ant_list):
for list2 in ant_list:
for item, row in df_grouped.iterrows():
col_name = str(list2)
if all(elem in row['StockCode'] for elem in list2):
df_grouped.loc[item,col_name] = 1
To get something like this:
Invoice | Customer ID | StockCode | ['20971'] | ['22064'] | ['79323P'] | ['84032A'] |
---|---|---|---|---|---|---|
489434 | 13085.0. | [85048,... | NaN. | 1.0. | 1.0. | NaN. |
However I need to find a vectorized way, or any other better way, so that I can escalate the solution to bigger datasets and I'm stuck.
CodePudding user response:
try:
df = df1.explode('antecedents').merge(df2.explode('StockCode'), right_on='StockCode', left_on='antecedents', how='left')
df
antecedents consequents Customer ID StockCode
0 20679 20676 NaN NaN
1 85048 20719 13085.0 85048
2 22143 22355 18102.0 22143
3 22065 20724 18087.0 22065
4 22138 20724 18087.0 22138
5 22138 20724 18087.0 22138
6 20754 212212 14110.0 20754
7 21035 212212 14110.0 21035
8 22041 212212 13085.0 22041
9 22041 212212 14110.0 22041
a = df.groupby('Customer ID')['StockCode'].apply(list).reset_index()
b= df.groupby('Customer ID')['consequents'].apply(list).reset_index()
c= df.groupby('Customer ID')['antecedents'].apply(list).reset_index()
a.merge(b).merge(c)
Customer ID StockCode consequents antecedents
0 13085.0 [85048, 22041] [20719, 212212] [85048, 22041]
1 14110.0 [20754, 21035, 22041] [212212, 212212, 212212] [20754, 21035, 22041]
2 18087.0 [22065, 22138, 22138] [20724, 20724, 20724] [22065, 22138, 22138]
3 18102.0 [22143] [22355] [22143]