I have the following dataframe, ordered by ID and Date :
df :
ID Date
A 2019-01-02
A 2019-02-09
A 2019-02-20
A 2019-02-27
A 2019-03-16
A 2019-03-29
A 2019-04-17
A 2019-04-19
A 2019-05-26
A 2019-06-12
B 2017-01-01
B ...
What I want is, when I encountered a row, to remove all rows with a date in the 40 days following, and repeat starting at the next (non-deleted) row I encounter. I want it for each ID.
For example with ID=A :
- I encounter the first row, date is
2019-01-02
. 40 days after is2019-02-11
, so I remove all rows with Date strictly between those 2 dates. So second row is removed (Date =2019-02-09
) - Then, I look for the next row with
Date >= 2019-02-11
(2019-02-11
being the end of the previous 40-days period). I find the 3rd row, with Date =2019-02-20
. This is now the reference date taken, and 40 days after is2019-04-01
. So 4th, 5th, 6th rows are removed - ...
Note that my 40-days periods are starting when I encounter a new row and no 40-days period is currently active. Thus, those periods are not the same for all IDs.
This gives the following output dataset
ID Date
A 2019-01-02
A 2019-02-20
A 2019-04-17
A 2019-06-12
B 2017-01-01
B ...
Thanks in advance
CodePudding user response:
You can loop though the dataframe to create a filter
import pandas as pd
import datetime
# create your dataframe "df" here...
# with columns "ID" and "Date"
# make sure your Date is a datetime field!
df['Date'] = pd.to_datetime(df['Date'])
prev_id = None
prev_date = None
keep_rows = []
for _, row in df.iterrows():
if row['ID'] == prev_id:
delete = (row['Date'] - prev_date) < datetime.timedelta(days=40)
if delete:
keep_rows.append(False)
continue
keep_rows.append(True)
prev_id = row['ID']
prev_date = row['Date']
And filter your dataframe like this
df[keep_rows]