Home > Software design >  Pandas: Drop duplicates that appear within a time interval pandas
Pandas: Drop duplicates that appear within a time interval pandas

Time:04-29

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:

Current Dataset:

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021 

Goal:

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021      

Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...

CodePudding user response:

You can try group by ID column and diff the DAY column in each group

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

from datetime import timedelta

m = timedelta(days=30)

out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)

   ID        DAY
0   1 2020-03-22
1   2 2020-05-10
2   2 2020-01-13
3   3 2020-03-30
4   3 2021-02-24

To convert to original date format, you can use dt.strftime

out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)

   ID         DAY
0   1  22.03.2020
1   2  10.05.2020
2   2  13.01.2020
3   3  30.03.2020
4   3  24.02.2021

CodePudding user response:

You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

mask = (df
        .sort_values(by=['ID', 'DAY'])
        .groupby('ID')['DAY']
        .diff().lt('30d')
        .sort_index()
       )

df[~mask]

NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint

output:

   ID        DAY
0   1 2020-03-22
2   2 2020-10-05
3   2 2020-01-13
4   3 2020-03-30
6   3 2021-02-24

Thus another approach might be to resample the data per group to 30days:

(df
 .groupby('ID')
 .resample('30d', on='DAY').first()
 .dropna()
 .convert_dtypes()
 .reset_index(drop=True)
)

output:

   ID        DAY
0   1 2020-03-22
1   2 2020-01-13
2   2 2020-10-05
3   3 2020-03-30
4   3 2021-02-24
  • Related