Ok so this is more of a question about how to properly use the groupby method since I am kinda struggling to use the DataFrameGroupBy object itself. Basically I have a big DataFrame with the following structure:
DATE | PRODUCT | PRICE | CAPACITY |
---|---|---|---|
01.07.2022 | NEG_00_04 | 3,7 | 7 |
01.07.2022 | NEG_00_04 | 1,7 | 3 |
01.07.2022 | NEG_00_04 | 2,4 | 5 |
01.07.2022 | NEG_00_04 | 2,2 | 7 |
01.07.2022 | POS_00_04 | 3,7 | 2 |
01.07.2022 | POS_00_04 | 3,2 | 5 |
01.07.2022 | POS_00_04 | 1,5 | 2 |
01.07.2022 | POS_00_04 | 2,4 | 3 |
My goal is to groupby the 'DATE' and 'PRODUCT' columns and get a cumulative capacity based on an ascending price. So basically the order of operation is to groupby the two columns then sort each group by the 'PRICE' column and calculate the cumulative capacity. the end result based on the sample table should look like this:
DATE | PRODUCT | PRICE | CAPACITY | CUMULATIVE |
---|---|---|---|---|
01.07.2022 | NEG_00_04 | 1,7 | 3 | 3 |
01.07.2022 | NEG_00_04 | 2,2 | 7 | 10 |
01.07.2022 | NEG_00_04 | 2,4 | 5 | 15 |
01.07.2022 | NEG_00_04 | 3,7 | 7 | 22 |
01.07.2022 | POS_00_04 | 1,5 | 2 | 2 |
01.07.2022 | POS_00_04 | 2,4 | 3 | 5 |
01.07.2022 | POS_00_04 | 3,2 | 5 | 10 |
01.07.2022 | POS_00_04 | 3,7 | 2 | 12 |
I already have a solution that does work but I was wondering if there isn't a better way to work with DataFrameGroupBy objects since I always just iterate through them with a for loop and it just doesn't seem right. This is how I did it:
df_result = pd.DataFrame()
for i, group in df.groupby(by=['DATE', 'PRODUCT']):
group.sort_values('PRICE', inplace=True)
group['CUMULATIVE'] = group['CAPACITY'].cumsum()
df_result = pd.concat([df_result, group], ignore_index=True)
I would appreciate any suggestions for improvement :)
CodePudding user response:
Use:
df = df.sort_values('PRICE')
df['CUMULATIVE'] = df.groupby(by=['DATE', 'PRODUCT'])['CAPACITY'].cumsum()
Or:
df = df.sort_values(['PRICE','DATE', 'PRODUCT'])
df['CUMULATIVE'] = df.groupby(by=['DATE', 'PRODUCT'])['CAPACITY'].cumsum()