My data consist of rows recording what stage an ID is in on a certain date. Ideally, each ID should proceed to stage 4 directionally, though it need not go through every stage. For example ID "A" below goes throgh stages 1->2->4. Sometimes (for a variety of possible reasons) an ID is returned to a previous stage. The data are sorted by date within each ID:
df = pd.DataFrame( {"ID": ["A","A","A","B","B","B","B","B","B","C","C","C","C","C","C"], "Stage":[4,2,1,4,3,4,3,2,1,4,3,2,1,2,1],\
"Date":['2022-09-18','2022-09-17','2022-09-16','2022-09-20','2022-09-19','2022-09-18','2022-09-17','2022-09-16','2022-09-15',\
'2022-09-20','2022-09-19','2022-09-18','2022-09-17','2022-09-16','2022-09-15']} )
print(df)
ID Stage Date
0 A 4 2022-09-18
1 A 2 2022-09-17
2 A 1 2022-09-16
3 B 4 2022-09-20
4 B 3 2022-09-19
5 B 4 2022-09-18
6 B 3 2022-09-17
7 B 2 2022-09-16
8 B 1 2022-09-15
9 C 4 2022-09-20
10 C 3 2022-09-19
11 C 2 2022-09-18
12 C 1 2022-09-17
13 C 2 2022-09-16
14 C 1 2022-09-15
I would like to drop all rows of data for each ID that occur before the most recent time that it is sent back to a previous stage. I would get something like this:
print(df_filtered)
ID Stage Date
0 A 4 2022-09-18
1 A 2 2022-09-17
2 A 1 2022-09-16
3 B 4 2022-09-20
4 B 3 2022-09-19
5 C 4 2022-09-20
6 C 3 2022-09-19
7 C 2 2022-09-18
8 C 1 2022-09-17
Notice that A has not changed as it moved directionally through the process, B has dropped all data before it was sent back to Stage 3 on 2022-09-19, and C has dropped the two rows before it was returned to Stage 1 on 2022-09-17. This is a greatly simplified case , but in the true data an ID might be returned to a previous stage several times, including within the same stage. For example, an ID might be sent from stage 2 back to stage 2.
Is there a clean way to get from df to df_filtered without using for loops?
CodePudding user response:
IIUC, sort the dates, then per group, check if there is an increase in Stage and drop the values afterwards:
m = (df
# ensure the dates are in decreasing order
# optional if the dates are already in descending order
.sort_values(by=['ID', 'Date'], ascending=[True, False])
# for each group, if Stage increases, flag this and the successive rows
.groupby('ID')['Stage'].apply(lambda x: x.diff().gt(0).cummax())
)
# select the non-flagged rows
out = df[~m]
output:
ID Stage Date
0 A 4 2022-09-18
1 A 2 2022-09-17
2 A 1 2022-09-16
3 B 4 2022-09-20
4 B 3 2022-09-19
9 C 4 2022-09-20
10 C 3 2022-09-19
11 C 2 2022-09-18
12 C 1 2022-09-17
intermediates:
ID Stage Date diff gt(0) cummax
0 A 4 2022-09-18 NaN False False
1 A 2 2022-09-17 -2.0 False False
2 A 1 2022-09-16 -1.0 False False
3 B 4 2022-09-20 NaN False False
4 B 3 2022-09-19 -1.0 False False
5 B 4 2022-09-18 1.0 True True
6 B 3 2022-09-17 -1.0 False True
7 B 2 2022-09-16 -1.0 False True
8 B 1 2022-09-15 -1.0 False True
9 C 4 2022-09-20 NaN False False
10 C 3 2022-09-19 -1.0 False False
11 C 2 2022-09-18 -1.0 False False
12 C 1 2022-09-17 -1.0 False False
13 C 2 2022-09-16 1.0 True True
14 C 1 2022-09-15 -1.0 False True