my mind is blown so hoping you can help here:) I have data that I need to 'cumsum' by month and year and cannot figure out how to do it. My data looks like this - testdf:
month year power_kwh
0 1 1990 416491241.4
1 2 1990 343033068.2
2 3 1990 301451921.2
3 4 1990 275719811.1
4 5 1990 292287261.0
5 6 1990 227289410.6
6 7 1990 189335818.5
7 8 1990 115649657.7
8 9 1990 191111685.6
9 10 1990 319101975.4
10 11 1990 366723962.5
11 12 1990 392145037.3
12 1 1991 333036371.8
13 2 1991 325317995.1
14 3 1991 370866553.8
15 4 1991 300957433.0
16 5 1991 223408949.5
17 6 1991 165998110.1
18 7 1991 147475868.2
19 8 1991 158274746.0
20 9 1991 181327778.7
21 10 1991 284857327.5
22 11 1991 298116425.1
23 12 1991 363653602.6
I need my data to look like this below so that the column (power_kwh) is summed by month and each unique year.
month year power_kwh cumsum_power_kwh
0 1 1990 416491241.4 416491241
1 2 1990 343033068.2 759524310
2 3 1990 301451921.2 1060976231
3 4 1990 275719811.1 1336696042
4 5 1990 292287261 1628983303
5 6 1990 227289410.6 1856272714
6 7 1990 189335818.5 2045608532
7 8 1990 115649657.7 2161258190
8 9 1990 191111685.6 2352369875
9 10 1990 319101975.4 2671471851
10 11 1990 366723962.5 3038195813
11 12 1990 392145037.3 3430340851
12 1 1991 333036371.8 333036372
13 2 1991 325317995.1 4088695217
14 3 1991 370866553.8 4459561771
15 4 1991 300957433 4760519204
16 5 1991 223408949.5 4983928154
17 6 1991 165998110.1 5149926264
18 7 1991 147475868.2 5297402132
19 8 1991 158274746 5455676878
20 9 1991 181327778.7 5637004657
21 10 1991 284857327.5 5921861984
22 11 1991 298116425.1 6219978409
23 12 1991 363653602.6 6583632012
I have tried this code below and also using the apply and lambda function and I keep getting an answer but the cumsum is not being calculated as you can see here. thank you for suggestions!
testdf['results'] = testdf.groupby(['year','month'])['power_kwh'].cumsum()
This is what I keep getting -
month year power_kwh results
0 1 1990 416491241.4 416491241.4
1 2 1990 343033068.2 343033068.2
2 3 1990 301451921.2 301451921.2
3 4 1990 275719811.1 275719811.1
4 5 1990 292287261.0 292287261.0
5 6 1990 227289410.6 227289410.6
6 7 1990 189335818.5 189335818.5
7 8 1990 115649657.7 115649657.7
8 9 1990 191111685.6 191111685.6
9 10 1990 319101975.4 319101975.4
10 11 1990 366723962.5 366723962.5
11 12 1990 392145037.3 392145037.3
12 1 1991 333036371.8 333036371.8
13 2 1991 325317995.1 325317995.1
14 3 1991 370866553.8 370866553.8
15 4 1991 300957433.0 300957433.0
16 5 1991 223408949.5 223408949.5
17 6 1991 165998110.1 165998110.1
18 7 1991 147475868.2 147475868.2
19 8 1991 158274746.0 158274746.0
20 9 1991 181327778.7 181327778.7
21 10 1991 284857327.5 284857327.5
22 11 1991 298116425.1 298116425.1
23 12 1991 363653602.6 363653602.6
CodePudding user response:
Your expected output can be computed by grouping by year only:
df['cumsum_power_kwh'] = df.groupby('year')['power_kwh'].cumsum()
if your dataframe is not guaranteed to be in order, you can sort it first:
df = df.sort_values(["year", "month"])
The problem you are encountering is because cumsum
is applying to each group individually, so for each unique (year
, month
) pair the cumsum
is reset to 0.