Home > Back-end >  Once a row is encountered, remove all rows within 40 days, then repeat
Once a row is encountered, remove all rows within 40 days, then repeat

Time:10-06

I have the following dataframe, ordered by ID and Date :

df :

ID     Date
A      2019-01-02
A      2019-02-09
A      2019-02-20
A      2019-02-27
A      2019-03-16
A      2019-03-29
A      2019-04-17
A      2019-04-19
A      2019-05-26
A      2019-06-12
B      2017-01-01
B      ...

What I want is, when I encountered a row, to remove all rows with a date in the 40 days following, and repeat starting at the next (non-deleted) row I encounter. I want it for each ID.

For example with ID=A :

  • I encounter the first row, date is 2019-01-02. 40 days after is 2019-02-11, so I remove all rows with Date strictly between those 2 dates. So second row is removed (Date = 2019-02-09)
  • Then, I look for the next row with Date >= 2019-02-11 (2019-02-11 being the end of the previous 40-days period). I find the 3rd row, with Date = 2019-02-20. This is now the reference date taken, and 40 days after is 2019-04-01. So 4th, 5th, 6th rows are removed
  • ...

Note that my 40-days periods are starting when I encounter a new row and no 40-days period is currently active. Thus, those periods are not the same for all IDs.

This gives the following output dataset

ID     Date
A      2019-01-02
A      2019-02-20
A      2019-04-17
A      2019-06-12
B      2017-01-01
B      ...

Thanks in advance

CodePudding user response:

You can loop though the dataframe to create a filter

import pandas as pd
import datetime 

# create your dataframe "df" here...
# with columns "ID" and "Date"

# make sure your Date is a datetime field!
df['Date'] = pd.to_datetime(df['Date'])

prev_id = None
prev_date = None
keep_rows = []

for _, row in df.iterrows():

    if row['ID'] == prev_id:
        delete = (row['Date'] - prev_date) < datetime.timedelta(days=40) 
        if delete:
            keep_rows.append(False)    
            continue
        
    keep_rows.append(True)
    prev_id = row['ID']
    prev_date = row['Date']

And filter your dataframe like this

df[keep_rows]
  • Related