I am trying to create a clipped cumsum. However, the clipping is based on the value of the previous row.
Below is an example of the data that I am trying to generate.
╔══════╦═══════════════════╦═════════════════════╦════════════╦═══════════╦══════════════════╗
║ Idx ║ datetime ║ energy_consumption ║ energy_pv ║ spare_pv ║ cumsum_spare_pv ║
╠══════╬═══════════════════╬═════════════════════╬════════════╬═══════════╬══════════════════╣
║ 1 ║ 18/09/2021 04:00 ║ 15.5 ║ 0 ║ -15.5 ║ 42.546 ║
║ 2 ║ 18/09/2021 05:00 ║ 15.2 ║ 0 ║ -15.2 ║ 27.346 ║
║ 3 ║ 18/09/2021 06:00 ║ 15.5 ║ 0.667 ║ -14.833 ║ 12.513 ║
║ 4 ║ 18/09/2021 07:00 ║ 15.1 ║ 5.598 ║ -9.502 ║ 3.011 ║
║ 5 ║ 18/09/2021 08:00 ║ 17.1 ║ 10.638 ║ -6.462 ║ 0 ║
║ 6 ║ 18/09/2021 09:00 ║ 17.4 ║ 15.146 ║ -2.254 ║ 0 ║
║ 7 ║ 18/09/2021 10:00 ║ 17.5 ║ 19.44 ║ 1.94 ║ 1.94 ║
║ 8 ║ 18/09/2021 11:00 ║ 18.1 ║ 23.226 ║ 5.126 ║ 7.066 ║
║ 9 ║ 18/09/2021 12:00 ║ 19.5 ║ 26.246 ║ 6.746 ║ 13.812 ║
║ 10 ║ 18/09/2021 13:00 ║ 20.5 ║ 26.864 ║ 6.364 ║ 20.176 ║
║ 11 ║ 18/09/2021 14:00 ║ 19.6 ║ 24.792 ║ 5.192 ║ 25.368 ║
║ 12 ║ 18/09/2021 15:00 ║ 18.4 ║ 18.869 ║ 0.469 ║ 25.837 ║
║ 13 ║ 18/09/2021 16:00 ║ 18.3 ║ 12.319 ║ -5.981 ║ 19.856 ║
║ 14 ║ 18/09/2021 17:00 ║ 19.4 ║ 4.637 ║ -14.763 ║ 5.093 ║
║ 15 ║ 18/09/2021 18:00 ║ 18.7 ║ 0 ║ -18.7 ║ 0 ║
║ 16 ║ 18/09/2021 19:00 ║ 18.8 ║ 0 ║ -18.8 ║ 0 ║
╚══════╩═══════════════════╩═════════════════════╩════════════╩═══════════╩══════════════════╝
I have the energy consumption and PV columns. spare_pv
is easy to generate using:
df['PV'] - df['energy_consumption']
However, I cannot work out how the create the cumsum column. This cannot be done using the pd.Series.cumsum()
, since logic is required to clip the value if the line above is 0
. I have an excel formula that I am currently using to generate this function. For this example, the columns are:
A: Idx, B: datetime, C: energy_consumption, D: energy_pv, E: spare_pv, F: cumsum_spare_pv
Row 1 would be initially calculated, then row 2 would have the formula:
=IF((E2 F1)<=0, 0, E2 F1)
I can do this by creating a list and looping over the DF to append values (similar to the way suggested in here), but considering this dataset is hundreds of thousands of lines, this will be very slow.
CodePudding user response:
As the cumsum is conditional, I dont think there is a way around looping it.
Instead of using iloc and for this purpose (which is slow) better to do this with numpy instead.
D = df['spare_pv'].values
# D = np.random.random(5000000)-0.5
start_value = 42.546 15.5
new_val = start_value D[0]
D[0] = new_val if new_val >0 else 0
for i in range(1,len(D)):
new_val = D[i] D[i-1]
D[i] = new_val if new_val >0 else 0
df['cumsum_spare_pv'] = D
Ive tried this operation on a random 5 million values as well and it completes in 3 secs or so, so should still be ok.