I have the following python pandas data frame:
df = pd.DataFrame( {
'Date':[202101,202102,202103,202104,202105,202106,202107,202101,202102,202103,202104,202105,
202106],
'ID': [1,1,1,1,1,1,1,2,2,2,2,2,2],
'amnt': [300,200,100,50,0,250,100,1000,500,200,100,0,0],'trx': [100,0,0,0,0,0,100,1000,500,0,0,0,0]} );
> Date ID amnt trx
0 202101 1 300 100
1 202102 1 200 0
2 202103 1 100 0
3 202104 1 50 0
4 202105 1 0 0
5 202106 1 250 0
6 202107 1 100 100
7 202101 2 1000 1000
8 202102 2 500 500
9 202103 2 200 0
10 202104 2 100 0
11 202105 2 0 0
12 202106 2 0 0
Would like to obtain this dataframe without :
The rule is : if amnt = 0 and trx = 0 for the last 3 months then status = No active (by ID) The size of my dataframe is about 10.000.000 rows.
Date ID amnt trx status
0 202101 1 300 100 active
1 202102 1 200 0 active
2 202103 1 100 0 active
3 202104 1 50 0 active
4 202105 1 0 0 No active
5 202106 1 250 0 active
6 202107 1 100 100 active
7 202101 2 1000 1000 active
8 202102 2 500 500 active
9 202103 2 200 0 active
10 202104 2 100 0 active
11 202105 2 0 0 active
12 202106 2 0 0 No active
I would be very happy with any advice on this or idea. Thank you.
CodePudding user response:
IIUC, use boolean mask:
m1 = df['amnt'].eq(0)
m2 = df.groupby('ID')['trx'].rolling(4).sum().eq(0).droplevel(0)
df['status'] = (m1 & m2).replace({True: 'No active', False: 'active'})
print(df)
# Output:
Date ID amnt trx status
0 202101 1 300 100 active
1 202102 1 200 0 active
2 202103 1 100 0 active
3 202104 1 50 0 active
4 202105 1 0 0 No active
5 202106 1 250 0 active
6 202107 1 100 100 active
7 202101 2 1000 1000 active
8 202102 2 500 500 active
9 202103 2 200 0 active
10 202104 2 100 0 active
11 202105 2 0 0 active
12 202106 2 0 0 No active