Whereas panda's drop_duplicates
function can be specified with "first"
, "last"
, or False
. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.
Any help is appreciated!
CodePudding user response:
Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:
n = 3
df.groupby('drop_dup_col').head(n)
This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.
Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.
Multiple columns can be specified in groupby using:
df.groupby(['col1','col5'])
Regarding the question in your comment:
It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.
n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)
CodePudding user response:
I believe a combination of groupby and tail(N) should work for this- In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:
df.groupby('myColumnDuplicates').tail(4)
To be more precise, and complete the answer with @Stijn 's answer, tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values