How to Optimize calculation for pandas using computed value for previous row-CodePudding

Let's say we have a dataframe::

d = {'date':[01-01-2022,01-02-2022,01-03-2022,01-04-2022], 
      'col1': [1, 2,3,4], 
     'col2': [3, 4,5,6], 
     'a': [4,1,3,1],
    }

df = pd.DataFrame(data=d)

from this data frame I want to calculate col3 which is defined as the following:

((Col1 - Sum of col3 for previous dates) / Col2) * a

So col 3 in the first row would be: ((1 - 0 [no previous data]) / 3) * 4 = 1.333

col 3 in the 2nd row would be ((2 - 1.33) / 4) * 1 = .1675

col 3 in the 3rd row would be: ((3 - (1.33 .1675) / 5) * 3 = .9015

So the value of col 3 is dependent on all values of column 3 in the rows above it.

Given this situation is there a way to calculate column 3 that does not involves iterating through the data frame rows? Iterating through the rows does not scale well and I need to run a similar calculation for a data frame with 100K rows.

My current approach for this would be the following:

df['col3'] = 0
for index, row in df.iterrows():
    sum_previous_rows = df[df['date'] <row['date']]['col3'].sum()
    df[index,'col3'] = row['col1']   row['col2']   sum_previous_rows

CodePudding user response：

You can use an [expanding sum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html):

import pandas as pd

d = {'date':['01-01-2022','01-02-2022','01-03-2022','01-04-2022'], 
      'col1': [1, 2,3,4], 
     'col2': [3, 4,5,6]}
df = pd.DataFrame(data=d)

df['col1 2'] = df['col1']   df['col2']

df['col3'] = df['col1 2'].expanding(1).sum()

This will get you this:

-	date	col1	col2	col1 2	col3
0	01-01-2022	1	3	4	4.0
1	01-02-2022	2	4	6	10.0
2	01-03-2022	3	5	8	18.0
3	01-04-2022	4	6	10	28.0

CodePudding user response：

Mathematically you can create the equal situation like below

y1 = x1
y2 = y1   x2 = x1   x2
y3 = x3   y2   x1 = x3   x1   x1   x2 = x1 *2   x2   x3
y4 = x4   y1   y2   y3 = x4   x3   x1   x1   x2  x1  x2   x1
                       = x4   x3   x2 *2   x1 *3

s = df.col1   df.col2
m = np.tril(np.ones((len(s),len(s))),k=0)*np.arange(len(s))
m = np.tril(-(m-m.max(1)[:,None]),k=0)
s.dot(m.T) s
Out[20]: 
0     4.0
1    10.0
2    22.0
3    42.0
dtype: float64