remove advance duplicate records (distinct in more than one columns in dataFrame)-CodePudding

I want to remove the duplicated values in my pandas dataFrame.

my DataFram is like this:

#  col_1  col_2  col_3  col_4
1    a      a      1      1    # unwanted
2    a      b     0.7    0.5
3    a      c     0.5    0.3
4    b      a     0.7    0.5   # Duplicated
5    b      b      1      1    # unwanted
6    b      c     0.8    0.6
7    c      a     0.5    0.3   # Duplicated
8    c      b     0.8    0.6   # Duplicated
9    c      c      1      1    # unwanted

How we can improve this DataFrame and remove unwanted and duplicated rows?

may you think that this data frame is like a squared matrix and we can use np.tril but now it's not that

cus we need to calculate the rank of col_3 and col_4

CodePudding user response：

In your case do np.sort then drop_duplicates

df[['col_1','col_2']] = np.sort(df[['col_1','col_2']].values,axis=1)
out = df.drop_duplicates(['col_1','col_2']).query('col_1!=col_2')
out
Out[118]: 
  col_1 col_2  col_3  col_4
2     a     b    0.7    0.5
3     a     c    0.5    0.3
6     b     c    0.8    0.6

CodePudding user response：

.drop_duplicates

df.drop_duplicates(subset =['col1', 'col2],
                     keep = False, inplace = True)