I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.
I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1, 15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
Is there faster (more optimized way) to calculate this?
I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.
CodePudding user response:
Looking at your DataFrame
as a numpy
array:
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total = xi
total = total if total < 0 else 0
y[i] = total
df['outcome'] = y
CodePudding user response:
I think numba is way how working with loops here if performance is important:
@njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3