I've been wrestling with this for a couple of days now, despite a lot of searching. I've come across a number of similar problems, but I've not been able to make any of the solutions work for me.
Here's my starting dataframe:
data = {
"account_id": ["1001", "1001", "1002", "1002", "1002", "1002", "1002", "1003", "1003", "1003"],
"data_type": ["initial_balance", "payment", "payment", "initial_balance", "payment", "payment", "payment", "payment", "initial_balance", "payment"],
"transaction_date": ["2022-04-01", "2022-04-14", "2022-03-01", "2022-04-02", "2022-04-13", "2022-05-01", "2022-05-03", "2022-03-13", "2022-04-10", "2022-04-20"],
"amount": [100, -20, -30, 200, -20, -20, -20, -10, 150, -50],}
Which, once in Pandas becomes:
I'm looking to group by account_id
and remove any entry before the entry with data_type
= "initial_balance". Once I've got this, I can then cumsum
over the remaining group rows to get to the current balance. So the desired outcome (including the cumsum
"account_balance" column) is:
Desired result including the cumsum
"account_balance" column
I've tried the following:
df.groupby("account_id").filter(lambda x:x["transaction_date"]>=x[x["data_type"]=="initial_balance"]["transaction_date"])
but this just produces the error: ValueError: Can only compare identically-labelled Series objects
I hope I've provided enough information for someone to help. Thanks very much in advance for your help.
CodePudding user response:
You can groupby account_id
and filter rows before the first initial_balance
then cumsum()
on amount
column
out = df.groupby('account_id').apply(lambda g: g[g['data_type'].eq('initial_balance').cumsum().eq(1)]).reset_index(drop=True)
out['amount'] = out.groupby('account_id')['amount'].cumsum()
print(out)
account_id data_type transaction_date amount
0 1001 initial_balance 2022-04-01 100
1 1001 payment 2022-04-14 80
2 1002 initial_balance 2022-04-02 200
3 1002 payment 2022-04-13 180
4 1002 payment 2022-05-01 160
5 1002 payment 2022-05-03 140
6 1003 initial_balance 2022-04-10 150
7 1003 payment 2022-04-20 100
CodePudding user response:
This does the job.
grouped_df = df.groupby("account_id")
groups = []
for group in df["account_id"].unique():
group_df = grouped_df.get_group(group)
group_df = group_df.loc[group_df[group_df["data_type"] == "initial_balance"].index[0]:, :]
group_df["amount"] = group_df["amount"].cumsum()
groups.append(group_df)
df = pd.concat(groups)
Output -
account_id | data_type | transaction_date | amount | |
---|---|---|---|---|
0 | 1001 | initial_balance | 2022-04-01 | 100 |
1 | 1001 | payment | 2022-04-14 | 80 |
3 | 1002 | initial_balance | 2022-04-02 | 200 |
4 | 1002 | payment | 2022-04-13 | 180 |
5 | 1002 | payment | 2022-05-01 | 160 |
CodePudding user response:
You can do it this way:
m = (df['data_type'] == "initial_balance").groupby(df['account_id']).cummax()
df_out = df[m].groupby('account_id')['amount'].cumsum()\
.reset_index(name='account_balance')\
.merge(df, left_on='index', right_index=True)
df_out
OUtput:
index account_balance account_id data_type transaction_date amount
0 0 100 1001 initial_balance 2022-04-01 100
1 1 80 1001 payment 2022-04-14 -20
2 3 200 1002 initial_balance 2022-04-02 200
3 4 180 1002 payment 2022-04-13 -20
4 5 160 1002 payment 2022-05-01 -20
5 6 140 1002 payment 2022-05-03 -20
6 8 150 1003 initial_balance 2022-04-10 150
7 9 100 1003 payment 2022-04-20 -50
Details, create a boolean series that is true when data_type equals initial_balance, then groupby that series by account_id using cummax() to create mask for the original dataframe keep inital_balance and records after that per account_id.
Next, groupby that filtered dataframea and cumsum by account_id, merge this data back to original dataframe using inner join to drop unwanted records.