Drop only specified amount of duplicates pandas-CodePudding

Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.

Any help is appreciated!

CodePudding user response：

Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:

n = 3
df.groupby('drop_dup_col').head(n)

This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.

Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.

Multiple columns can be specified in groupby using:

df.groupby(['col1','col5'])

Regarding the question in your comment:

It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.

n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)

CodePudding user response：

I believe a combination of groupby and tail(N) should work for this- In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:

df.groupby('myColumnDuplicates').tail(4)

To be more precise, and complete the answer with @Stijn 's answer, tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values