Home > front end >  Problem with Cumsum Single Column Groupby
Problem with Cumsum Single Column Groupby

Time:09-17

my mind is blown so hoping you can help here:) I have data that I need to 'cumsum' by month and year and cannot figure out how to do it. My data looks like this - testdf:

     month  year    power_kwh
0        1  1990  416491241.4
1        2  1990  343033068.2
2        3  1990  301451921.2
3        4  1990  275719811.1
4        5  1990  292287261.0
5        6  1990  227289410.6
6        7  1990  189335818.5
7        8  1990  115649657.7
8        9  1990  191111685.6
9       10  1990  319101975.4
10      11  1990  366723962.5
11      12  1990  392145037.3
12       1  1991  333036371.8
13       2  1991  325317995.1
14       3  1991  370866553.8
15       4  1991  300957433.0
16       5  1991  223408949.5
17       6  1991  165998110.1
18       7  1991  147475868.2
19       8  1991  158274746.0
20       9  1991  181327778.7
21      10  1991  284857327.5
22      11  1991  298116425.1
23      12  1991  363653602.6

I need my data to look like this below so that the column (power_kwh) is summed by month and each unique year.

    month   year    power_kwh   cumsum_power_kwh
0   1       1990    416491241.4 416491241
1   2       1990    343033068.2 759524310
2   3       1990    301451921.2 1060976231
3   4       1990    275719811.1 1336696042
4   5       1990    292287261   1628983303
5   6       1990    227289410.6 1856272714
6   7       1990    189335818.5 2045608532
7   8       1990    115649657.7 2161258190
8   9       1990    191111685.6 2352369875
9   10      1990    319101975.4 2671471851
10  11      1990    366723962.5 3038195813
11  12      1990    392145037.3 3430340851
12  1       1991    333036371.8 333036372
13  2       1991    325317995.1 4088695217
14  3       1991    370866553.8 4459561771
15  4       1991    300957433   4760519204
16  5       1991    223408949.5 4983928154
17  6       1991    165998110.1 5149926264
18  7       1991    147475868.2 5297402132
19  8       1991    158274746   5455676878
20  9       1991    181327778.7 5637004657
21  10      1991    284857327.5 5921861984
22  11      1991    298116425.1 6219978409
23  12      1991    363653602.6 6583632012

I have tried this code below and also using the apply and lambda function and I keep getting an answer but the cumsum is not being calculated as you can see here. thank you for suggestions!

testdf['results'] = testdf.groupby(['year','month'])['power_kwh'].cumsum()

This is what I keep getting -

     month  year    power_kwh      results
0        1  1990  416491241.4  416491241.4
1        2  1990  343033068.2  343033068.2
2        3  1990  301451921.2  301451921.2
3        4  1990  275719811.1  275719811.1
4        5  1990  292287261.0  292287261.0
5        6  1990  227289410.6  227289410.6
6        7  1990  189335818.5  189335818.5
7        8  1990  115649657.7  115649657.7
8        9  1990  191111685.6  191111685.6
9       10  1990  319101975.4  319101975.4
10      11  1990  366723962.5  366723962.5
11      12  1990  392145037.3  392145037.3
12       1  1991  333036371.8  333036371.8
13       2  1991  325317995.1  325317995.1
14       3  1991  370866553.8  370866553.8
15       4  1991  300957433.0  300957433.0
16       5  1991  223408949.5  223408949.5
17       6  1991  165998110.1  165998110.1
18       7  1991  147475868.2  147475868.2
19       8  1991  158274746.0  158274746.0
20       9  1991  181327778.7  181327778.7
21      10  1991  284857327.5  284857327.5
22      11  1991  298116425.1  298116425.1
23      12  1991  363653602.6  363653602.6

CodePudding user response:

Your expected output can be computed by grouping by year only:

df['cumsum_power_kwh'] = df.groupby('year')['power_kwh'].cumsum()

if your dataframe is not guaranteed to be in order, you can sort it first:

df = df.sort_values(["year", "month"])

The problem you are encountering is because cumsum is applying to each group individually, so for each unique (year, month) pair the cumsum is reset to 0.

  • Related