Home > other >  Combine two column into a column in dictionary format after performing groupby operation
Combine two column into a column in dictionary format after performing groupby operation

Time:09-19

I have a data frame as shown below df:

cust_id      products              rec_product          conf      sup
1            ['phone', 'tv']       ball                 0.68      0.12
1            ['phone', 'tv']       bat                  0.21      0.34
1            ['phone', 'tv']       book                 0.02      0.25
2            ['bat']               ball                 0.97      0.18 
2            ['bat']               book                 0.65      0.65
2            ['bat']               phone                0.23      0.36
2            ['bat']               tv                   0.03      0.48

Where I wants to combine rec_product and conf column as dictionary after performing groupby

Expected output:

cust_id      products             prod_conf                                            prod_sup    
1            ['phone', 'tv']      {'ball':0.68, 'bat':0.21, 'book':0.02}               {'ball':0.12, 'bat':0.34, 'book':0.25}
2            ['bat']              {'ball':0.97, 'book':0.65, 'phone':0.23, 'tv':0.03}  {'ball':0.18, 'book':0.65, 'phone':0.36, 'tv':0.48}

I tried below code it worked. But I would like to know is there any faster methods than this, which consume less memory and executes faster.

Combine rec_product and conf into one column

prod_conf_df = df.sort_values(['cust_id', 'conf'], ascending=[True, 
                                        False]).set_index('rec_product').groupby(['cust_id', 'products']).\
                                       apply(lambda x: x['conf'].to_dict()).reset_index(name='prod_conf')

Combine rec_product and sup into one column

prod_sup_df = df.sort_values(['cust_id', 'conf'], ascending=[True, 
                                        False]).set_index('rec_product').groupby(['cust_id']).\
                                       apply(lambda x: x['sup'].to_dict()).reset_index(name='prod_sup')

combine both the above dfs into one

combined_df = pd.merge(prod_conf_df, prod_supp_df, on='cust_id', how='inner')

CodePudding user response:

Instead of using multiple groupby's apply's..I would suggest doing all the aggregations using a single groupby inside a comprehension

def dictify(k, g):
    return {
        'cust_id': k,
        'products' : g['products'].iat[0],
        'prod_conf': dict(zip(g['rec_product'], g['conf'])),
        'prod_sup' : dict(zip(g['rec_product'], g['sup']))
    }    

s = df.sort_values(['cust_id', 'conf'], ascending=[True, False])
s = pd.DataFrame(dictify(k, g) for k, g in s.groupby('cust_id', sort=False))

Result

   cust_id         products                                                prod_conf                                                 prod_sup
0        1  ['phone', 'tv']                {'ball': 0.68, 'bat': 0.21, 'book': 0.02}                {'ball': 0.12, 'bat': 0.34, 'book': 0.25}
1        2          ['bat']  {'ball': 0.97, 'book': 0.65, 'phone': 0.23, 'tv': 0.03}  {'ball': 0.18, 'book': 0.65, 'phone': 0.36, 'tv': 0.48}
  • Related