I'd like to flag if a row is duplicated, and attach if it's the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.
More visually, I'd like to go from:
id | Country | City |
---|---|---|
1 | France | Paris |
2 | France | Paris |
3 | France | Lyon |
4 | France | Lyon |
5 | France | Lyon |
to
id | Country | City | duplicated_flag |
---|---|---|---|
1 | France | Paris | 1 |
2 | France | Paris | 1 |
3 | France | Lyon | 2 |
4 | France | Lyon | 2 |
5 | France | Lyon | 2 |
Note that id is not taken into account to see if the row is duplicated.
CodePudding user response:
Two options:
First, if you have lots of columns that you need to compare, you can use:
comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()
We drop the columns that aren't needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift()
and .any()
. Finally, we read off the value of duplicated_flag
using .cumsum()
.
But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum()
to get the value of duplicated_flag
for each row. It's a bit more verbose so I'm not super happy with this option, but I'm leaving this here for completeness in case this suits your use case better:
country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)
These output:
id Country City duplicated_flag
0 1 France Paris 1
1 2 France Paris 1
2 3 France Lyon 2
3 4 France Lyon 2
4 5 France Lyon 2