The problem is the following: how to vectorize situations where a value on the next line of a dataframe depends on a previous one? I want to avoid the for loop.
The row logic I need for the desired column is, given an "open balance" different from 0 or NaN in row[0]:
(1) row_t.open_balance = row_t-1.close_balance
(2) row_t.close_balance = row_t.open_balance row_t.additions - row_t.subtractions
Besides being slow, the infamous for loop leads me to all kinds of SettingWithCopyWarning
situations.
Can anyone help here? I've tried "helper columns" to no avail because of equation #2 above.
Thanks in advance!
CodePudding user response:
First, had to create reproducible example so I can try to answer this question. Remember to do this next time you ask a question.
Here I simulate additions and subtractions.
import pandas as pd
import numpy as np
additions = np.random.randint(0, 100, 10)
subtractions = np.random.randint(0, 50, 10)
net_change = pd.DataFrame({'balance': list(additions - subtractions)})
net_change
balance
0 44
1 13
2 70
3 -36
4 73
5 -6
6 54
7 30
8 39
9 80
Create DataFrame with starting balance of 0.
df = pd.DataFrame({'balance': [0]})
df
balance
0 0
Concatenate the DataFrame with net_change.
df = pd.concat([df, net_change], ignore_index=True)
df
balance
0 0
1 44
2 13
3 70
4 -36
5 73
6 -6
7 54
8 30
9 39
10 80
Then use pandas.DataFrame.cumsum to get the cumulative close balance which is what I assume is what you're are to compute.
df.cumsum()
balance
0 0
1 44
2 57
3 127
4 91
5 164
6 158
7 212
8 242
9 281
10 361
CodePudding user response:
When dealing with neighbor rows and vectorization, a good tool is shift.
Create a data frame with the mentioned column names and some values:
df = pd.DataFrame([[0, 10, 2], [0, 40, 5], [0, 70, 8]], columns=['open_balance', 'additions', 'subtractions'])
'open_balance' relies pretty heavy on additions and subtractions. We use shift() to compute it based on the previous day:
df['open_balance'] = (df['additions'] - df['subtractions']).shift(1)
We clean the data frame from Nans:
df.fillna(0, inplace=True)
Now it is easy to compute 'close_balance':
df['close_balance'] = df['open_balance'] df['additions'] - df['subtractions']