Home > Software design >  Adding new column based on combined criteria in Pandas Groupby
Adding new column based on combined criteria in Pandas Groupby

Time:05-14

Following on from my previous question (thanks to those responding) I'm stuck again in achieving what I suspect is possible using a groupby in Pandas. Here's what I'm trying to achieve. With the following example dataframe:

data_initial = {
"account_id": ['1001', '1001', '1001', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1002', '1003', '1003', '1003', '1003', '1003', '1003',],
"data_type": ['payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'plan', 'plan', 'payment', 'payment', 'payment', 'payment', 'payment', 'plan', 'payment', 'payment', 'payment', 'payment',],
"transaction_date": ['2022-04-01', '2022-04-12', '2022-05-02', '2022-02-02', '2022-03-01', '2022-03-15', '2022-04-01', '2022-04-01', '2022-04-13', '2022-04-26', '2022-05-01', '2022-05-04', '2022-05-10', '2022-03-10', '2022-03-25', '2022-04-05', '2022-04-16', '2022-04-24', '2022-05-05',],
"amount": ['-50', '-40', '-60', '-30', '-25', '250', '-50', '200', '200', '-25', '-25', '-25', '-25', '-20', '100', '-25', '-25', '-25', '-25',],}

initial dataframe

I'm looking to, effectively, groupby the account_id and then apply the following logic:

  • IF data_type is "payment" AND {account_id has no data_type = "plan" OR the transaction_date of the record is BEFORE any data_type = "plan" record} then new column classification = "receipt_not_plan_related"

  • IF data_type is "payment" AND {account_id has a data_type = "plan" AND transaction_date is AFTER any data_type = "plan" record} then new column classification = "receipt_on_plan"

  • IF data_type is "plan" is the only instance of "plan" then new column classification = "only"

  • IF data_type is "plan" AND is the FIRST instance of "plan" then new column classification = "initial"

  • IF data_type is "plan" AND is NOT the FIRST and NOT the LAST instance of "plan" then new column classification = "expired"

  • IF data_type is "plan" AND is the LAST instance of "plan" then new column classification = "current"

The result, therefore, for the example dataframe would be as follows:

enter image description here

Thanks again, in advance, to anyone who can help out. Much appreciated.

CodePudding user response:

You can do it with np.select and a couple of helper columns:

import numpy as np

df['plans'] = df.groupby('account_id')['data_type'].transform(lambda x: x.eq('plan').cumsum())
df['n_plans'] = df.groupby('account_id')['plans'].transform('max')

is_payment = df['data_type'].eq('payment')
is_plan = df['data_type'].eq('plan')
df['classification'] = np.select([is_payment & df['plans'].eq(0),
                                  is_payment & df['plans'].gt(0),
                                  is_plan & df['n_plans'].eq(1),
                                  is_plan & df['plans'].eq(1),
                                  is_plan & df['plans'].gt(1) & df['plans'].lt(df['n_plans']),
                                  is_plan & df['plans'].eq(df['n_plans'])], 
                                ['receipt_not_plan_related',
                                 'receipt_on_plan',
                                 'only',
                                 'initial',
                                 'expired',
                                 'current'])

print(df.drop(columns=['plans', 'n_plans']))
   account_id data_type transaction_date amount            classification
0        1001   payment       2022-04-01    -50  receipt_not_plan_related
1        1001   payment       2022-04-12    -40  receipt_not_plan_related
2        1001   payment       2022-05-02    -60  receipt_not_plan_related
3        1002   payment       2022-02-02    -30  receipt_not_plan_related
4        1002   payment       2022-03-01    -25  receipt_not_plan_related
5        1002      plan       2022-03-15    250                   initial
6        1002   payment       2022-04-01    -50           receipt_on_plan
7        1002      plan       2022-04-01    200                   expired
8        1002      plan       2022-04-13    200                   current
9        1002   payment       2022-04-26    -25           receipt_on_plan
10       1002   payment       2022-05-01    -25           receipt_on_plan
11       1002   payment       2022-05-04    -25           receipt_on_plan
12       1002   payment       2022-05-10    -25           receipt_on_plan
13       1003   payment       2022-03-10    -20  receipt_not_plan_related
14       1003      plan       2022-03-25    100                      only
15       1003   payment       2022-04-05    -25           receipt_on_plan
16       1003   payment       2022-04-16    -25           receipt_on_plan
17       1003   payment       2022-04-24    -25           receipt_on_plan
18       1003   payment       2022-05-05    -25           receipt_on_plan

Note that the records need to be sorted by 'transaction_date' in ascending order within each 'account_id' for this to work, since the "before", "first" and "last" conditions are checked using GroupBy.transform() that calculates a cumulative sum.

  • Related