Home > Software design >  Python Dataframe Duplicates Removal based on combination of two columns
Python Dataframe Duplicates Removal based on combination of two columns

Time:09-27

I have a pandas dataframe that contains duplicates, but not regular duplicates you can remove by using simple df.drop_duplicates. For example when combining columns 1 and 2, for the purpose of my work, AB and BA are the same, and the 5th row should be eliminated.

1 2
---
A B ===> AB
C D
E F
G H
B A ===> BA
I J
K L

I want to remove duplicates keeping only one of the first rows. This would lead to:

1 2
---
A B
C D
E F
G H
I J
K L

cannot figure out how to do that. any help would be appreciated.

Update: (adding another column with numbers)

1 2  3
-------
A B 0.2 ===> AB
C D 0.1
E F 0.0
G H 0.5
B A 0.2 ===> BA
I J 0.3
K L 0.6

CodePudding user response:

One solution could be:

import pandas as pd

data = {'1': {0: 'A', 1: 'C', 2: 'E', 3: 'G', 4: 'B', 5: 'I', 6: 'K'}, 
        '2': {0: 'B', 1: 'D', 2: 'F', 3: 'H', 4: 'A', 5: 'J', 6: 'L'},
        '3': {0: 0.2, 1: 0.1, 2: 0.0, 3: 0.5, 4: 0.2, 5: 0.3, 6: 0.6}
        }

df = pd.DataFrame(data)

out = df.loc[~df[['1','2']].apply(sorted, axis=1).duplicated()]
print(out)

   1  2    3
0  A  B  0.2
1  C  D  0.1
2  E  F  0.0
3  G  H  0.5
5  I  J  0.3
6  K  L  0.6

Above solution assumes that you want to get rid of "duplicates" based on column 1 and 2. However, if you want to look for "duplicates" on the entire df including col 3, you'll need to convert all values to the same dtype (i.e. strings) first. So, in that case, you might do:

out = df.loc[~df.astype(str).apply(sorted, axis=1).duplicated()]
print(out)

   1  2    3
0  A  B  0.2
1  C  D  0.1
2  E  F  0.0
3  G  H  0.5
5  I  J  0.3
6  K  L  0.6

N.B. This will not affect the dtype of col 3 in the actual result of course; it still contains float values.

  • Related