I have a dataframe that contains duplicate values in columns, I need to identify the duplicates, keep the first and replace the others, could be zero or NaN
Here is an example of the df
id | Tables | Industry | Logistic | Comercial | Feedback | Return | Finished |
---|---|---|---|---|---|---|---|
1 | 166 | 325158.0 | NaN | None | 2140957.0 | NaN | NaN |
2 | 379 | 161616.0 | 417296.0 | None | 1808454.0 | NaN | NaN |
3 | 136 | 1729.0. | 417296.0 | None | 1734326.0 | NaN | 217 |
4 | 1173 | 174533.0 | 417296.0 | None | 1734188.0 | NaN | 217 |
5 | 111 | 1531.0 | 406413.0 | None | 1714706.0 | NaN | 217 |
Expected result:
id | Tables | Industry | Logistic | Comercial | Feedback | Return | Finished |
---|---|---|---|---|---|---|---|
1 | 166 | 325158.0 | NaN | None | 2140957.0 | NaN | NaN |
2 | 379 | 161616.0 | 417296.0 | None | 1808454.0 | NaN | NaN |
3 | 136 | 1729.0. | None | None | 1734326.0 | NaN | 217 |
4 | 1173 | 174533.0 | None | None | 1734188.0 | NaN | None |
5 | 111 | 1531.0 | 406413.0 | None | 1714706.0 | NaN | None |
CodePudding user response:
To, for example, replace the duplicates in the Logistic column, first find the duplicates:
df.Logistic.duplicated()
By default, keep = ‘first’
(more info about pd.duplicated)
So when you localize al duplicates using .loc, you can easily replace them:
df.loc[df.Logistic.duplicated(), 'Logistic'] = None
A for-loop can do this trick for all your columns:
for col_name in df.columns:
df.loc[df[col_name].duplicated(), col_name] = None
If anyone has a more elegant way to do it for all columns, would love to see it :)