Pandas: Drop duplicates that appear within a time interval pandas-CodePudding

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:

Current Dataset:

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021

Goal:

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021

Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...

CodePudding user response：

You can try group by ID column and diff the DAY column in each group

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

from datetime import timedelta

m = timedelta(days=30)

out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)

print(out)

   ID        DAY
0   1 2020-03-22
1   2 2020-05-10
2   2 2020-01-13
3   3 2020-03-30
4   3 2021-02-24

To convert to original date format, you can use dt.strftime

out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')

print(out)

   ID         DAY
0   1  22.03.2020
1   2  10.05.2020
2   2  13.01.2020
3   3  30.03.2020
4   3  24.02.2021

CodePudding user response：

You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

mask = (df
        .sort_values(by=['ID', 'DAY'])
        .groupby('ID')['DAY']
        .diff().lt('30d')
        .sort_index()
       )

df[~mask]

NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint

output:

   ID        DAY
0   1 2020-03-22
2   2 2020-10-05
3   2 2020-01-13
4   3 2020-03-30
6   3 2021-02-24

Thus another approach might be to resample the data per group to 30days:

(df
 .groupby('ID')
 .resample('30d', on='DAY').first()
 .dropna()
 .convert_dtypes()
 .reset_index(drop=True)
)

output:

   ID        DAY
0   1 2020-03-22
1   2 2020-01-13
2   2 2020-10-05
3   3 2020-03-30
4   3 2021-02-24