Home > Back-end >  For each unique item in a data frame column, get the last rows based on a looking/scanning window
For each unique item in a data frame column, get the last rows based on a looking/scanning window

Time:11-29

patient_id alert_id alert_timestamp
3 xyz 2022-10-10
1 anp 2022-10-12
1 gfe 2022-10-10
2 fgy 2022-10-02
2 gpl 2022-10-03
1 gdf 2022-10-13
2 mkd 2022-10-23
1 liu 2022-10-01

I have a data frame (see simplified version above) where for each patient_id, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7.

Note, the window size needs to look at consecutive days i.e. between day 1 -> day 1 window_size. The ranges of alert_timestamp for each patient_id varies and is usually well beyond the window_size range.

Note, that the data frame example given above, is a very simple example and will have many more patient_id's and will be in a mixed order in terms alert_timestamp and alert_id.

The approach is to start from the last alert_timstamp for a given patient_id and work back using the window_size to select the alert that was the last one in that window time frame.

Please note the idea is to have a scanning/looking window, example window_size = 7 days to move across the timestamps of each patient

The end result I want, is a data frame with the filtered out alerts

Expected output for (this example) window_size = 7:

patient_id alert_id alert_timestamp
1 liu 2022-10-01
1 gdf 2022-10-13
2 gpl 2022-10-03
2 mkd 2022-10-23
3 xyz 2022-10-10

What's the most efficient way to solve for this?

CodePudding user response:

You can use:

# ensure datetime
df['alert_timestamp'] = pd.to_datetime(df['alert_timestamp'])

# get last date per patient
# and compute bins of 7 days
group = (df['alert_timestamp']
         .rsub(df.groupby('patient_id')['alert_timestamp']
                 .transform('max'))
         .floordiv(pd.Timedelta('7d')) # or .dt.days.floordiv(7)
        )

# keep only most recent of each alert per group
out = (df
       .sort_values(by=['patient_id', 'alert_timestamp'])
       .groupby(['patient_id', group], as_index=False, sort=False)
       .last()
      )

Output:

   patient_id alert_id alert_timestamp
0           1      liu      2022-10-01
1           1      gdf      2022-10-13
2           2      gpl      2022-10-02
3           2      mkd      2022-10-23
4           3      xyz      2022-10-10
  • Related