I have a dataframe of dates
from datetime import datetime as dtm
df=pd.DataFrame([dtm(2022, 1, 1),dtm(2022,1,2),dtm(2022,1,5),dtm(2022,2,28),dtm(2022,3,1),dtm(2022,5,1)])
0
0 2022-01-01
1 2022-01-02
2 2022-01-05
3 2022-02-28
4 2022-03-01
5 2022-05-01
I want to start from the first date, and get rid of any date(s) that's within 20 days of the referenced date.
The desired df would be
pd.DataFrame([dtm(2022, 1, 1),dtm(2022,2,28),dtm(2022,5,1)])
0 2022-01-01
1 2022-02-28
2 2022-05-01
I tried writing a loop, but the referenced dataframe would need to update as loop is run. I'm not sure how to deal with that.
#Edit, this this example using 3 days
df=pd.DataFrame([dtm(2022, 1, 1),dtm(2022,1,2),dtm(2022,2,27),dtm(2022,2,28),dtm(2022,3,3),dtm(2022,3,5),dtm(2022,5,1)])
0
0 2022-01-01
1 2022-01-02
2 2022-02-27
3 2022-02-28
4 2022-03-03
5 2022-03-05
6 2022-05-01
The desired output is
pd.DataFrame([dtm(2022, 1, 1),dtm(2022,2,27),dtm(2022,3,3),dtm(2022,5,1)])
0
0 2022-01-01
1 2022-02-27
2 2022-03-03
3 2022-05-01
CodePudding user response:
IIUC, you can use a custom function to sequentially remove the values and update the reference:
def cum_drop(s, delta='3d'):
if s.empty:
return s
ref = s[0]
delta = pd.Timedelta(delta)
idx = [s.index[0]]
for i, d in s.items():
if d-ref>=delta:
idx.append(i)
ref = d
return s.loc[idx]
df.apply(cum_drop)
output:
0
0 2022-01-01
2 2022-02-27
4 2022-03-03
6 2022-03-07
7 2022-05-01
used input:
0
0 2022-01-01
1 2022-01-02
2 2022-02-27
3 2022-02-28
4 2022-03-03
5 2022-03-05
6 2022-03-07
7 2022-05-01