Let's say we have a dataframe::
d = {'date':[01-01-2022,01-02-2022,01-03-2022,01-04-2022],
'col1': [1, 2,3,4],
'col2': [3, 4,5,6],
'a': [4,1,3,1],
}
df = pd.DataFrame(data=d)
from this data frame I want to calculate col3 which is defined as the following:
((Col1 - Sum of col3 for previous dates) / Col2) * a
So col 3 in the first row would be: ((1 - 0 [no previous data]) / 3) * 4 = 1.333
col 3 in the 2nd row would be ((2 - 1.33) / 4) * 1 = .1675
col 3 in the 3rd row would be: ((3 - (1.33 .1675) / 5) * 3 = .9015
So the value of col 3 is dependent on all values of column 3 in the rows above it.
Given this situation is there a way to calculate column 3 that does not involves iterating through the data frame rows? Iterating through the rows does not scale well and I need to run a similar calculation for a data frame with 100K rows.
My current approach for this would be the following:
df['col3'] = 0
for index, row in df.iterrows():
sum_previous_rows = df[df['date'] <row['date']]['col3'].sum()
df[index,'col3'] = row['col1'] row['col2'] sum_previous_rows
CodePudding user response:
You can use an [expanding sum]
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html):
import pandas as pd
d = {'date':['01-01-2022','01-02-2022','01-03-2022','01-04-2022'],
'col1': [1, 2,3,4],
'col2': [3, 4,5,6]}
df = pd.DataFrame(data=d)
df['col1 2'] = df['col1'] df['col2']
df['col3'] = df['col1 2'].expanding(1).sum()
This will get you this:
- | date | col1 | col2 | col1 2 | col3 |
---|---|---|---|---|---|
0 | 01-01-2022 | 1 | 3 | 4 | 4.0 |
1 | 01-02-2022 | 2 | 4 | 6 | 10.0 |
2 | 01-03-2022 | 3 | 5 | 8 | 18.0 |
3 | 01-04-2022 | 4 | 6 | 10 | 28.0 |
CodePudding user response:
Mathematically you can create the equal situation like below
y1 = x1
y2 = y1 x2 = x1 x2
y3 = x3 y2 x1 = x3 x1 x1 x2 = x1 *2 x2 x3
y4 = x4 y1 y2 y3 = x4 x3 x1 x1 x2 x1 x2 x1
= x4 x3 x2 *2 x1 *3
s = df.col1 df.col2
m = np.tril(np.ones((len(s),len(s))),k=0)*np.arange(len(s))
m = np.tril(-(m-m.max(1)[:,None]),k=0)
s.dot(m.T) s
Out[20]:
0 4.0
1 10.0
2 22.0
3 42.0
dtype: float64