Home > Enterprise >  Pandas: Group records every N days and mark duplicates
Pandas: Group records every N days and mark duplicates

Time:02-25

I'd like to mark duplicate records within my data.

A duplicate is defined as: for each person_key, if a person_key is repeated within 25 days of the first. However, after 25 days the count is reset.

So if I had records for a person_key on Day 0, Day 20, Day 26, Day 30, the first and third would be kept, as the third is more than 25 days from the first. The second and fourth are marked as duplicate, since they are within 25 days from the "first of their group".

In other words, I think I need to identify groups of 25 day blocks, and then "dedupe" within those blocks. I'm struggling to start with to create these initial groups.

I will eventually have to apply to 5m records, so am trying to steer clear of pd.DataFrame.apply

person_key  date    duplicate
A   2019-01-01  False
B   2019-02-01  False
B   2019-02-12  True
C   2019-03-01  False
A   2019-01-10  True
A   2019-01-26  False
A   2019-01-28  True
A   2019-02-10  True
A   2019-04-01  False

Thanks for your help!

CodePudding user response:

Let us create a custom grouper with frequency of 25 days, then group the given dataframe by person_key and this grouper and use cumcount to create a sequential counter per group, then compare this counter with 0 to identify the duplicate rows

df['date'] = pd.to_datetime(df['date'])

grouper = pd.Grouper(key='date', freq='25D', origin='start')
df['duplicate'] = df.groupby(['person_key', grouper]).cumcount().ne(0)

Result

print(df)

  person_key       date  duplicate
0          A 2019-01-01      False
1          B 2019-02-01      False
2          B 2019-02-12       True
3          C 2019-03-01      False
4          A 2019-01-10       True
5          A 2019-01-26      False
6          A 2019-01-28       True
7          A 2019-02-10       True
8          A 2019-04-01      False
  • Related