patient_id | alert_id | alert_timestamp |
---|---|---|
3 | xyz | 2022-10-10 |
1 | anp | 2022-10-12 |
1 | gfe | 2022-10-10 |
2 | fgy | 2022-10-02 |
2 | gpl | 2022-10-03 |
1 | gdf | 2022-10-13 |
2 | mkd | 2022-10-23 |
1 | liu | 2022-10-01 |
I have a data frame (see simplified version above) where for each patient_id
, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7
.
Note, the window size needs to look at consecutive days i.e. between day 1
-> day 1 window_size
. The ranges of alert_timestamp
for each patient_id
varies and is usually well beyond the window_size
range.
Note, that the data frame example given above, is a very simple example and will have many more patient_id
's and will be in a mixed order in terms alert_timestamp
and alert_id
.
The approach is to start from the last alert_timstamp
for a given patient_id
and work back using the window_size
to select the alert that was the last one in that window time frame.
Please note the idea is to have a scanning/looking window, example window_size = 7 days
to move across the timestamps of each patient
The end result I want, is a data frame with the filtered out alerts
Expected output for (this example) window_size = 7
:
patient_id | alert_id | alert_timestamp |
---|---|---|
1 | liu | 2022-10-01 |
1 | gdf | 2022-10-13 |
2 | gpl | 2022-10-03 |
2 | mkd | 2022-10-23 |
3 | xyz | 2022-10-10 |
What's the most efficient way to solve for this?
CodePudding user response:
You can use:
# ensure datetime
df['alert_timestamp'] = pd.to_datetime(df['alert_timestamp'])
# get last date per patient
# and compute bins of 7 days
group = (df['alert_timestamp']
.rsub(df.groupby('patient_id')['alert_timestamp']
.transform('max'))
.floordiv(pd.Timedelta('7d')) # or .dt.days.floordiv(7)
)
# keep only most recent of each alert per group
out = (df
.sort_values(by=['patient_id', 'alert_timestamp'])
.groupby(['patient_id', group], as_index=False, sort=False)
.last()
)
Output:
patient_id alert_id alert_timestamp
0 1 liu 2022-10-01
1 1 gdf 2022-10-13
2 2 gpl 2022-10-02
3 2 mkd 2022-10-23
4 3 xyz 2022-10-10