I'd like to mark duplicate records within my data.
A duplicate is defined as: for each person_key, if a person_key is repeated within 25 days of the first. However, after 25 days the count is reset.
So if I had records for a person_key on Day 0, Day 20, Day 26, Day 30, the first and third would be kept, as the third is more than 25 days from the first. The second and fourth are marked as duplicate, since they are within 25 days from the "first of their group".
In other words, I think I need to identify groups of 25 day blocks, and then "dedupe" within those blocks. I'm struggling to start with to create these initial groups.
I will eventually have to apply to 5m records, so am trying to steer clear of pd.DataFrame.apply
person_key date duplicate
A 2019-01-01 False
B 2019-02-01 False
B 2019-02-12 True
C 2019-03-01 False
A 2019-01-10 True
A 2019-01-26 False
A 2019-01-28 True
A 2019-02-10 True
A 2019-04-01 False
Thanks for your help!
CodePudding user response:
Let us create a custom grouper
with frequency of 25 days
, then group the given dataframe by person_key
and this grouper and use cumcount to create a sequential counter per group, then compare this counter with 0
to identify the duplicate rows
df['date'] = pd.to_datetime(df['date'])
grouper = pd.Grouper(key='date', freq='25D', origin='start')
df['duplicate'] = df.groupby(['person_key', grouper]).cumcount().ne(0)
Result
print(df)
person_key date duplicate
0 A 2019-01-01 False
1 B 2019-02-01 False
2 B 2019-02-12 True
3 C 2019-03-01 False
4 A 2019-01-10 True
5 A 2019-01-26 False
6 A 2019-01-28 True
7 A 2019-02-10 True
8 A 2019-04-01 False