I have dataframe in pandas containing project number and processing times columns for a time series. Below is a made-up example. For each projectNo I'd like to only include rows with a 1 month processingTime interval difference to another row (with the same projectNo). So in the example below I'd like to remove the rows in index 5,6 and 7. How could this be done given the above logic?
import pandas as pd
projectNo = [1,1,1,2,2,2,3,3]
processingTime = ['2021-11','2021-12','2022-01','2022-03','2022-04','2022-06','2020-05','2022-06']
#create new df
df = pd.DataFrame({'projectNo':projectNo, 'processingTime':processingTime })
index | projectNo | processingTime |
---|---|---|
0 | 1 | 2021-11 |
1 | 1 | 2021-12 |
2 | 1 | 2022-01 |
3 | 2 | 2022-03 |
4 | 2 | 2022-04 |
5 | 2 | 2022-06 |
6 | 3 | 2020-05 |
7 | 3 | 2020-07 |
CodePudding user response:
You can try
df['processingTime'] = pd.to_datetime(df['processingTime'])
out = (df.groupby('projectNo', as_index=False, group_keys=False)
# 1 month difference at least have 29 days, at most have 31 days
# if the difference is greater than 31 days, it means more than 1 month
# Fill the 1st row's difference with the 1st row and the 2nd
.apply(lambda g: g[~g['processingTime'].diff().bfill().gt(pd.Timedelta(days=31))]))
print(out)
projectNo processingTime
0 1 2021-11-01
1 1 2021-12-01
2 1 2022-01-01
3 2 2022-03-01
4 2 2022-04-01