Home > Back-end >  How do I compute the cumlative sum of a column while skipping the first two rows?
How do I compute the cumlative sum of a column while skipping the first two rows?

Time:11-17

I have a pandas dataframe that looks like:

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         4.638107      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
CES - Storage          Hydro        4.277016       0.000000         37.916419
                       Solar        9.774715       0.000000         72.916419
                       Wind        11.881869       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         2.101907      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
Reference - No Storage Hydro        4.277016       0.000000         35.380219
                       Solar       14.289311       0.000000         70.380219
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         18.566327
                       NGCC         4.533380      25.109150         29.001897
                       NGGT        17.224408      32.703513         30.145397
                       Overflow    35.000000     169.679554         34.678777
Reference - Storage    Hydro        4.277016       0.000000         51.903185
                       Solar       14.894274       0.000000         86.903185
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         19.171290
                       NGCC         4.533380      25.109150         29.606860
                       NGGT        14.524706      32.703513         30.750360
                       Overflow    35.000000     169.679554         35.283740

I have created the chained_capacity variable by using:

stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)

But this is not the result I want. As you can see it's still starting the sum with the first initial value in the column. I want the sum to start with on the 3rd value. So the expected output would be:

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000         11.881870
                       Nuclear      5.242805      12.689066         17.124674
                       NGCC         2.101907      25.109150         17.12   2.10
                       NGGT         4.638107      32.703513         ...
                       Overflow    35.000000     169.679554         ...
...

Here is df.to_dict() to be able to reproduce the data completely:

{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}

CodePudding user response:

Using tricks from this answer, either

# this transform may be slow for large dataframes
stack['chained_capacity'] = \
    stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))

or

# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])
  • Related