How to remove misaligned time series rows in pandas dataframe?-CodePudding

I have dataframe in pandas containing project number and processing times columns for a time series. Below is a made-up example. For each projectNo I'd like to only include rows with a 1 month processingTime interval difference to another row (with the same projectNo). So in the example below I'd like to remove the rows in index 5,6 and 7. How could this be done given the above logic?

import pandas as pd

projectNo = [1,1,1,2,2,2,3,3]
processingTime = ['2021-11','2021-12','2022-01','2022-03','2022-04','2022-06','2020-05','2022-06']
#create new df 
df = pd.DataFrame({'projectNo':projectNo, 'processingTime':processingTime })

index	projectNo	processingTime
0	1	2021-11
1	1	2021-12
2	1	2022-01
3	2	2022-03
4	2	2022-04
5	2	2022-06
6	3	2020-05
7	3	2020-07

CodePudding user response：

You can try

df['processingTime'] = pd.to_datetime(df['processingTime'])

out = (df.groupby('projectNo', as_index=False, group_keys=False)
       # 1 month difference at least have 29 days, at most have 31 days
       # if the difference is greater than 31 days, it means more than 1 month
       # Fill the 1st row's difference with the 1st row and the 2nd
       .apply(lambda g: g[~g['processingTime'].diff().bfill().gt(pd.Timedelta(days=31))]))

print(out)

   projectNo processingTime
0          1     2021-11-01
1          1     2021-12-01
2          1     2022-01-01
3          2     2022-03-01
4          2     2022-04-01