Home > Net >  Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative s
Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative s

Time:10-28

I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.

I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1, 15,-1,-1,-1], columns=['quantity'])


for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
    else:
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome   df.loc[index, 'quantity']

print(df)

#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0

Is there faster (more optimized way) to calculate this?

I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.

CodePudding user response:

Looking at your DataFrame as a numpy array:

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
    total  = xi
    total = total if total < 0 else 0
    y[i] = total

df['outcome'] = y

CodePudding user response:

I think numba is way how working with loops here if performance is important:

@njit
def f(x, lim):
    result = np.empty(len(x), dtype=np.int)
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome   x[i]
    return result

df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
    quantity  outcome  outcome1
0         -1     -1.0        -1
1         -1     -2.0        -2
2         -1     -3.0        -3
3         -1     -4.0        -4
4         15     11.0        11
5         -1     -1.0        -1
6         -1     -2.0        -2
7         -1     -3.0        -3
8         -1     -4.0        -4
9          5      1.0         1
10        -1     -1.0        -1
11        15     14.0        14
12        -1     -1.0        -1
13        -1     -2.0        -2
14        -1     -3.0        -3
  • Related