How do I remove the duplicate rows based on pairwise columns (Col1, Col2) and (Col3, Col4)
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Col1 Col2 Col3 Col4
A B C D
A B A B
C D C D
A B B A
C D D C
The desired output is:
Col1 Col2 Col3 Col4
A B C D
A B B A
C D D C
row two and row three are dropped because
A B = A B and C D = C D
I tried something like
df.drop_duplicates(subset=[['Col1', 'Col2'],['Col3', 'Col4']])
but this is not right.
CodePudding user response:
Let us try compare with values
out = df[np.all(df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values,1)]
Out[298]:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
CodePudding user response:
You could try comparing columns like this.
new_df = df[(df['Col1'] != df['Col3']) & (df['Col2'] != df['Col4'])]
print(new_df)
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C
CodePudding user response:
I used the following approach:
For the df ->
import pandas as pd
df = pd.DataFrame({'Col1' : ['A','A','C','A','C'],
'Col2' : ['B','B','D','B','D'],
'Col3' : ['C','A','C','B','D'],
'Col4' :['D','B','D','A','C']})
Comparing the values of (Col1,Col2)
with (Col3,Col4)
and drop the duplicates
desired_output = df[df[['Col1', 'Col2']].values != df[['Col3', 'Col4']].values].drop_duplicates()
desired_output
Output:
Col1 Col2 Col3 Col4
0 A B C D
3 A B B A
4 C D D C