dataframe group, sum and concatenate-CodePudding

I have a dataframe dfsorted :

dfsorted = df.sort_values(["sku"], ascending=[True])
print(dfsorted.head())

id	sku	bill	qty_left
186	01-04	50469	0
16	01-20	50262	15
267	01-20	50460	1
18	01-20	50262	5
17	01-20	50262	5

How can I group / aggregate the dfsorted into this desired result:

sku	bill	qty_left
01-04	50469	0
01-20	50262, 50460	26

So :

group the dataframe by 'sku'
for each 'sku', concatenate the 'bill' values (these are already formatted as strings, I don't care if there are duplicates but unique values would be nice too)
for each 'sku', sum the 'qty_left' values.

Thanks!

CodePudding user response：

Use agg, where you can apply both custom (lambda) functions as standard (such as sum) functions:

df.groupby('sku').agg({'bill': lambda x: set(x), 'qty_left':'sum'})

set makes sure they are unique values, using list makes them just concatenated.

result:

        bill            qty_left
sku     
01-04   {50469}         0
01-20   {50460, 50262}  26

If you want a string instead of a set for bill you can use:

df2.bill.apply(lambda s: ', '.join(list(map(str, s))))

Where df2 is the result of the groupby.agg function above.

CodePudding user response：

Use GroupBy.agg with lambda function for remove duplicates in original ordering:

df1 = (df.groupby('sku', as_index=False)
         .agg({'bill': lambda x:','.join(dict.fromkeys(x)), 
               'qty_left':'sum'}))
print (df1)
     sku         bill  qty_left
0  01-04        50469         0
1  01-20  50262,50460        26

If bfill column are strings use:

df1 = (df.astype({'bill':str})
         .groupby('sku', as_index=False)
         .agg({'bill': lambda x:','.join(dict.fromkeys(x)), 
               'qty_left':'sum'}))
print (df1)
     sku         bill  qty_left
0  01-04        50469         0
1  01-20  50262,50460        26