Home > front end >  Drop only specified amount of duplicates pandas
Drop only specified amount of duplicates pandas

Time:04-05

Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.

Any help is appreciated!

CodePudding user response:

Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:

n = 3
df.groupby('drop_dup_col').head(n)

This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.

Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.

Multiple columns can be specified in groupby using:

df.groupby(['col1','col5'])

Regarding the question in your comment:

It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.

n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)

CodePudding user response:

I believe a combination of groupby and tail(N) should work for this- In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:

df.groupby('myColumnDuplicates').tail(4)

To be more precise, and complete the answer with @Stijn 's answer, tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values

  • Related