Home > front end >  Find duplicates, keep first and replace rest
Find duplicates, keep first and replace rest

Time:01-04

I have a dataframe that contains duplicate values in columns, I need to identify the duplicates, keep the first and replace the others, could be zero or NaN

Here is an example of the df

id Tables Industry Logistic Comercial Feedback Return Finished
1 166 325158.0 NaN None 2140957.0 NaN NaN
2 379 161616.0 417296.0 None 1808454.0 NaN NaN
3 136 1729.0. 417296.0 None 1734326.0 NaN 217
4 1173 174533.0 417296.0 None 1734188.0 NaN 217
5 111 1531.0 406413.0 None 1714706.0 NaN 217

Expected result:

id Tables Industry Logistic Comercial Feedback Return Finished
1 166 325158.0 NaN None 2140957.0 NaN NaN
2 379 161616.0 417296.0 None 1808454.0 NaN NaN
3 136 1729.0. None None 1734326.0 NaN 217
4 1173 174533.0 None None 1734188.0 NaN None
5 111 1531.0 406413.0 None 1714706.0 NaN None

CodePudding user response:

To, for example, replace the duplicates in the Logistic column, first find the duplicates:

df.Logistic.duplicated()

By default, keep = ‘first’ (more info about pd.duplicated)

So when you localize al duplicates using .loc, you can easily replace them:

df.loc[df.Logistic.duplicated(), 'Logistic'] = None

A for-loop can do this trick for all your columns:

for col_name in df.columns:
    df.loc[df[col_name].duplicated(), col_name] = None

If anyone has a more elegant way to do it for all columns, would love to see it :)

  • Related