I have a pandas dataframe that looks like:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 4.638107 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
CES - Storage Hydro 4.277016 0.000000 37.916419
Solar 9.774715 0.000000 72.916419
Wind 11.881869 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 2.101907 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
Reference - No Storage Hydro 4.277016 0.000000 35.380219
Solar 14.289311 0.000000 70.380219
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 18.566327
NGCC 4.533380 25.109150 29.001897
NGGT 17.224408 32.703513 30.145397
Overflow 35.000000 169.679554 34.678777
Reference - Storage Hydro 4.277016 0.000000 51.903185
Solar 14.894274 0.000000 86.903185
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 19.171290
NGCC 4.533380 25.109150 29.606860
NGGT 14.524706 32.703513 30.750360
Overflow 35.000000 169.679554 35.283740
I have created the chained_capacity
variable by using:
stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)
But this is not the result I want. As you can see it's still starting the sum with the first initial value in the column. I want the sum to start with on the 3rd value. So the expected output would be:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 11.881870
Nuclear 5.242805 12.689066 17.124674
NGCC 2.101907 25.109150 17.12 2.10
NGGT 4.638107 32.703513 ...
Overflow 35.000000 169.679554 ...
...
Here is df.to_dict()
to be able to reproduce the data completely:
{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}
CodePudding user response:
Using tricks from this answer, either
# this transform may be slow for large dataframes
stack['chained_capacity'] = \
stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))
or
# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])