pandas merge using list columns and contains operation-CodePudding

I have two dataframes that are given below

multi_df = pd.DataFrame({'multi_project_ID': ["Combo_1","Combo_2","Combo_3","Combo_4"],
               'multi_items':[['Chips','Biscuits','Chocolates'],['Alcoholic Drinks','Juices','Fruits'],['Plants','Veggies','Chips'],['Cars']],
                'multi_labels':[[1,2,3],[4,5,6],[8,9,10],[11]]})


single_df = pd.DataFrame({'single_project_ID': ["ABC_1","DEF_2","JKL_3","MNO_3"],
                      'single_items':[['Chips'],['Alcoholic Drinks'],['Biscuits'],['Smoking']],
                      'single_labels':[[1],[4],[8],[9]]})

I would like to do the below

a) Check whether items of single_items list is present under the items of multi_items list.

b) If yes, then extract the multi_labels and multi_project_id for the corresponding matching item

c) If no item is present/matching, then put NA

So, I tried the below but it doesn't work. I don't know how and where to start.

print(single_df.groupby('single_labels').sum()['single_items'].apply(lambda x: list(set(x))).reset_index())

I expect my output to be like as below

CodePudding user response：

Create helper column tmp and use DataFrame.explode for remove lists, then aggregate lists with custom function - only for repeated values (length is greater like 1):

f = lambda x: list(x) if len(x) > 1 else x
df = multi_df.assign(tmp=multi_df['multi_items']).explode('tmp').groupby('tmp').agg(f)

print (df)

                    multi_project_ID  \
tmp                                    
Alcoholic Drinks             Combo_2   
Biscuits                     Combo_1   
Cars                         Combo_4   
Chips             [Combo_1, Combo_3]   
Chocolates                   Combo_1   
Fruits                       Combo_2   
Juices                       Combo_2   
Plants                       Combo_3   
Veggies                      Combo_3   

                                                        multi_items  \
tmp                                                                   
Alcoholic Drinks                 [Alcoholic Drinks, Juices, Fruits]   
Biscuits                              [Chips, Biscuits, Chocolates]   
Cars                                                         [Cars]   
Chips             [[Chips, Biscuits, Chocolates], [Plants, Veggi...   
Chocolates                            [Chips, Biscuits, Chocolates]   
Fruits                           [Alcoholic Drinks, Juices, Fruits]   
Juices                           [Alcoholic Drinks, Juices, Fruits]   
Plants                                     [Plants, Veggies, Chips]   
Veggies                                    [Plants, Veggies, Chips]   

                             multi_labels  
tmp                                        
Alcoholic Drinks                [4, 5, 6]  
Biscuits                        [1, 2, 3]  
Cars                                 [11]  
Chips             [[1, 2, 3], [8, 9, 10]]  
Chocolates                      [1, 2, 3]  
Fruits                          [4, 5, 6]  
Juices                          [4, 5, 6]  
Plants                         [8, 9, 10]  
Veggies                        [8, 9, 10]

Then convert single_items to scalars by str[0] and assign to index, so possible use DataFrame.join:

out = single_df.set_index(single_df['single_items'].str[0]).join(df).reset_index(drop=True)
print (out)
  single_project_ID        single_items single_labels    multi_project_ID  \
0             ABC_1             [Chips]           [1]  [Combo_1, Combo_3]   
1             DEF_2  [Alcoholic Drinks]           [4]             Combo_2   
2             JKL_3          [Biscuits]           [8]             Combo_1   
3             MNO_3           [Smoking]           [9]                 NaN   

                                         multi_items             multi_labels  
0  [[Chips, Biscuits, Chocolates], [Plants, Veggi...  [[1, 2, 3], [8, 9, 10]]  
1                 [Alcoholic Drinks, Juices, Fruits]                [4, 5, 6]  
2                      [Chips, Biscuits, Chocolates]                [1, 2, 3]  
3                                                NaN                      NaN