pandas finding duplicate rows with different label-CodePudding

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe. This isn't hard but I am wondering what the most elegant solution for this is. Here an example:

import pandas as pd

df = pd.DataFrame({
    "feature_1" : [0,0,0,4,4,2],
    "feature_2" : [0,5,5,1,1,3],
    "label" : ["A","A","B","B","D","A"]
})

result_df = pd.DataFrame({
    "cluster_index" : [0,0,1,1],
    "feature_1" : [0,0,4,4],
    "feature_2" : [5,5,1,1],
    "label" : ["A","B","B","D"]
})

CodePudding user response：

In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:

g = df.groupby(['feature_1', 'feature_2'])['label']

(df.assign(cluster_index=g.ngroup()) # get group name
   .loc[g.transform('size').gt(1)]   # filter the non-duplicates
   # line below only to have a nice cluster_index range (0,1…)
   .assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)

output:

   feature_1  feature_2 label  cluster_index
1          0          5     A              0
2          0          5     B              0
3          4          1     B              1
4          4          1     D              1

CodePudding user response：

First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:

df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()

df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
   feature_1  feature_2 label  cluster_index
1          0          5     A              0
2          0          5     B              0
3          4          1     B              1
4          4          1     D              1