Filtering rows in Pandas Groupby based on a condition within the group-CodePudding

I've been wrestling with this for a couple of days now, despite a lot of searching. I've come across a number of similar problems, but I've not been able to make any of the solutions work for me.

Here's my starting dataframe:

data = {
"account_id": ["1001", "1001", "1002", "1002", "1002", "1002", "1002", "1003", "1003", "1003"],
"data_type": ["initial_balance", "payment", "payment", "initial_balance", "payment", "payment", "payment", "payment", "initial_balance", "payment"],
"transaction_date": ["2022-04-01", "2022-04-14", "2022-03-01", "2022-04-02", "2022-04-13", "2022-05-01", "2022-05-03", "2022-03-13", "2022-04-10", "2022-04-20"],
"amount": [100, -20, -30, 200, -20, -20, -20, -10, 150, -50],}

Which, once in Pandas becomes:

I'm looking to group by account_id and remove any entry before the entry with data_type = "initial_balance". Once I've got this, I can then cumsum over the remaining group rows to get to the current balance. So the desired outcome (including the cumsum "account_balance" column) is:

Desired result including the cumsum "account_balance" column

I've tried the following:

df.groupby("account_id").filter(lambda x:x["transaction_date"]>=x[x["data_type"]=="initial_balance"]["transaction_date"])

but this just produces the error: ValueError: Can only compare identically-labelled Series objects

I hope I've provided enough information for someone to help. Thanks very much in advance for your help.

CodePudding user response：

You can groupby account_id and filter rows before the first initial_balance then cumsum() on amount column

out = df.groupby('account_id').apply(lambda g: g[g['data_type'].eq('initial_balance').cumsum().eq(1)]).reset_index(drop=True)
out['amount'] = out.groupby('account_id')['amount'].cumsum()

print(out)

  account_id        data_type transaction_date  amount
0       1001  initial_balance       2022-04-01     100
1       1001          payment       2022-04-14      80
2       1002  initial_balance       2022-04-02     200
3       1002          payment       2022-04-13     180
4       1002          payment       2022-05-01     160
5       1002          payment       2022-05-03     140
6       1003  initial_balance       2022-04-10     150
7       1003          payment       2022-04-20     100

CodePudding user response：

This does the job.

grouped_df = df.groupby("account_id")
groups = []

for group in df["account_id"].unique():
  group_df = grouped_df.get_group(group)
  group_df = group_df.loc[group_df[group_df["data_type"] == "initial_balance"].index[0]:, :]
  group_df["amount"] = group_df["amount"].cumsum()
  groups.append(group_df)

df = pd.concat(groups)

Output -

	account_id	data_type	transaction_date	amount
0	1001	initial_balance	2022-04-01	100
1	1001	payment	2022-04-14	80
3	1002	initial_balance	2022-04-02	200
4	1002	payment	2022-04-13	180
5	1002	payment	2022-05-01	160

CodePudding user response：

You can do it this way:

m = (df['data_type'] == "initial_balance").groupby(df['account_id']).cummax()

df_out = df[m].groupby('account_id')['amount'].cumsum()\
              .reset_index(name='account_balance')\
              .merge(df, left_on='index', right_index=True)
df_out

OUtput:

   index  account_balance account_id        data_type transaction_date  amount
0      0              100       1001  initial_balance       2022-04-01     100
1      1               80       1001          payment       2022-04-14     -20
2      3              200       1002  initial_balance       2022-04-02     200
3      4              180       1002          payment       2022-04-13     -20
4      5              160       1002          payment       2022-05-01     -20
5      6              140       1002          payment       2022-05-03     -20
6      8              150       1003  initial_balance       2022-04-10     150
7      9              100       1003          payment       2022-04-20     -50

Details, create a boolean series that is true when data_type equals initial_balance, then groupby that series by account_id using cummax() to create mask for the original dataframe keep inital_balance and records after that per account_id.

Next, groupby that filtered dataframea and cumsum by account_id, merge this data back to original dataframe using inner join to drop unwanted records.