Home > database >  dataframe group, sum and concatenate
dataframe group, sum and concatenate

Time:11-25

I have a dataframe dfsorted :

dfsorted = df.sort_values(["sku"], ascending=[True])
print(dfsorted.head())
id sku bill qty_left
186 01-04 50469 0
16 01-20 50262 15
267 01-20 50460 1
18 01-20 50262 5
17 01-20 50262 5

How can I group / aggregate the dfsorted into this desired result:

sku bill qty_left
01-04 50469 0
01-20 50262, 50460 26

So :

  • group the dataframe by 'sku'
  • for each 'sku', concatenate the 'bill' values (these are already formatted as strings, I don't care if there are duplicates but unique values would be nice too)
  • for each 'sku', sum the 'qty_left' values.

Thanks!

CodePudding user response:

Use agg, where you can apply both custom (lambda) functions as standard (such as sum) functions:

df.groupby('sku').agg({'bill': lambda x: set(x), 'qty_left':'sum'})

set makes sure they are unique values, using list makes them just concatenated.

result:

        bill            qty_left
sku     
01-04   {50469}         0
01-20   {50460, 50262}  26

If you want a string instead of a set for bill you can use:

df2.bill.apply(lambda s: ', '.join(list(map(str, s))))

Where df2 is the result of the groupby.agg function above.

CodePudding user response:

Use GroupBy.agg with lambda function for remove duplicates in original ordering:

df1 = (df.groupby('sku', as_index=False)
         .agg({'bill': lambda x:','.join(dict.fromkeys(x)), 
               'qty_left':'sum'}))
print (df1)
     sku         bill  qty_left
0  01-04        50469         0
1  01-20  50262,50460        26

If bfill column are strings use:

df1 = (df.astype({'bill':str})
         .groupby('sku', as_index=False)
         .agg({'bill': lambda x:','.join(dict.fromkeys(x)), 
               'qty_left':'sum'}))
print (df1)
     sku         bill  qty_left
0  01-04        50469         0
1  01-20  50262,50460        26
  • Related