Filter grouped data based on stage ordering-CodePudding

My data consist of rows recording what stage an ID is in on a certain date. Ideally, each ID should proceed to stage 4 directionally, though it need not go through every stage. For example ID "A" below goes throgh stages 1->2->4. Sometimes (for a variety of possible reasons) an ID is returned to a previous stage. The data are sorted by date within each ID:

df = pd.DataFrame( {"ID": ["A","A","A","B","B","B","B","B","B","C","C","C","C","C","C"], "Stage":[4,2,1,4,3,4,3,2,1,4,3,2,1,2,1],\
        "Date":['2022-09-18','2022-09-17','2022-09-16','2022-09-20','2022-09-19','2022-09-18','2022-09-17','2022-09-16','2022-09-15',\
            '2022-09-20','2022-09-19','2022-09-18','2022-09-17','2022-09-16','2022-09-15']} )
    print(df)
       ID  Stage        Date
    0   A      4  2022-09-18
    1   A      2  2022-09-17
    2   A      1  2022-09-16
    3   B      4  2022-09-20
    4   B      3  2022-09-19
    5   B      4  2022-09-18
    6   B      3  2022-09-17
    7   B      2  2022-09-16
    8   B      1  2022-09-15
    9   C      4  2022-09-20
    10  C      3  2022-09-19
    11  C      2  2022-09-18
    12  C      1  2022-09-17
    13  C      2  2022-09-16
    14  C      1  2022-09-15

I would like to drop all rows of data for each ID that occur before the most recent time that it is sent back to a previous stage. I would get something like this:

print(df_filtered)
  ID  Stage        Date
0  A      4  2022-09-18
1  A      2  2022-09-17
2  A      1  2022-09-16
3  B      4  2022-09-20
4  B      3  2022-09-19
5  C      4  2022-09-20
6  C      3  2022-09-19
7  C      2  2022-09-18
8  C      1  2022-09-17

Notice that A has not changed as it moved directionally through the process, B has dropped all data before it was sent back to Stage 3 on 2022-09-19, and C has dropped the two rows before it was returned to Stage 1 on 2022-09-17. This is a greatly simplified case , but in the true data an ID might be returned to a previous stage several times, including within the same stage. For example, an ID might be sent from stage 2 back to stage 2.

Is there a clean way to get from df to df_filtered without using for loops?

CodePudding user response：

IIUC, sort the dates, then per group, check if there is an increase in Stage and drop the values afterwards:

m = (df
  # ensure the dates are in decreasing order
  # optional if the dates are already in descending order
  .sort_values(by=['ID', 'Date'], ascending=[True, False])
  # for each group, if Stage increases, flag this and the successive rows
  .groupby('ID')['Stage'].apply(lambda x: x.diff().gt(0).cummax())
)

# select the non-flagged rows
out = df[~m]

output:

   ID  Stage        Date
0   A      4  2022-09-18
1   A      2  2022-09-17
2   A      1  2022-09-16
3   B      4  2022-09-20
4   B      3  2022-09-19
9   C      4  2022-09-20
10  C      3  2022-09-19
11  C      2  2022-09-18
12  C      1  2022-09-17

intermediates:

   ID  Stage        Date  diff  gt(0)  cummax
0   A      4  2022-09-18   NaN  False   False
1   A      2  2022-09-17  -2.0  False   False
2   A      1  2022-09-16  -1.0  False   False
3   B      4  2022-09-20   NaN  False   False
4   B      3  2022-09-19  -1.0  False   False
5   B      4  2022-09-18   1.0   True    True
6   B      3  2022-09-17  -1.0  False    True
7   B      2  2022-09-16  -1.0  False    True
8   B      1  2022-09-15  -1.0  False    True
9   C      4  2022-09-20   NaN  False   False
10  C      3  2022-09-19  -1.0  False   False
11  C      2  2022-09-18  -1.0  False   False
12  C      1  2022-09-17  -1.0  False   False
13  C      2  2022-09-16   1.0   True    True
14  C      1  2022-09-15  -1.0  False    True