Python data frame drop duplicate rows based on pairwise columns-CodePudding

How do I remove the duplicate rows based on pairwise columns (Col1, Col2) and (Col3, Col4)

import pandas as pd
df = pd.DataFrame({'Col1' :  ['A','A','C','A','C'],
                   'Col2' : ['B','B','D','B','D'],
                   'Col3' : ['C','A','C','B','D'],
                   'Col4' :['D','B','D','A','C']})


Col1    Col2    Col3    Col4
 A        B      C       D
 A        B      A       B
 C        D      C       D
 A        B      B       A
 C        D      D       C

The desired output is:

Col1    Col2    Col3    Col4
 A        B      C       D
 A        B      B       A
 C        D      D       C

row two and row three are dropped because

A B = A B and C D = C D

I tried something like

df.drop_duplicates(subset=[['Col1', 'Col2'],['Col3', 'Col4']])

but this is not right.

CodePudding user response：

Let us try compare with values

out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]: 
  Col1 Col2 Col3 Col4
0    A    B    C    D
3    A    B    B    A
4    C    D    D    C

CodePudding user response：

You could try comparing columns like this.

new_df = df[(df['Col1'] != df['Col3']) & (df['Col2'] != df['Col4'])]
print(new_df)

Output:

  Col1 Col2 Col3 Col4
0    A    B    C    D
3    A    B    B    A
4    C    D    D    C

CodePudding user response：

I used the following approach:

For the df ->

import pandas as pd
df = pd.DataFrame({'Col1' :  ['A','A','C','A','C'],
                   'Col2' : ['B','B','D','B','D'],
                   'Col3' : ['C','A','C','B','D'],
                   'Col4' :['D','B','D','A','C']})

Comparing the values of (Col1,Col2) with (Col3,Col4) and drop the duplicates

desired_output = df[df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values].drop_duplicates()
desired_output

Output:

    Col1    Col2    Col3    Col4
0    A       B       C        D
3    A       B       B        A
4    C       D       D        C