I have a pandas dataframe that contains duplicates, but not regular duplicates you can remove by using simple df.drop_duplicates
. For example when combining columns 1 and 2, for the purpose of my work, AB
and BA
are the same, and the 5th row should be eliminated.
1 2
---
A B ===> AB
C D
E F
G H
B A ===> BA
I J
K L
I want to remove duplicates keeping only one of the first rows. This would lead to:
1 2
---
A B
C D
E F
G H
I J
K L
cannot figure out how to do that. any help would be appreciated.
Update: (adding another column with numbers)
1 2 3
-------
A B 0.2 ===> AB
C D 0.1
E F 0.0
G H 0.5
B A 0.2 ===> BA
I J 0.3
K L 0.6
CodePudding user response:
One solution could be:
import pandas as pd
data = {'1': {0: 'A', 1: 'C', 2: 'E', 3: 'G', 4: 'B', 5: 'I', 6: 'K'},
'2': {0: 'B', 1: 'D', 2: 'F', 3: 'H', 4: 'A', 5: 'J', 6: 'L'},
'3': {0: 0.2, 1: 0.1, 2: 0.0, 3: 0.5, 4: 0.2, 5: 0.3, 6: 0.6}
}
df = pd.DataFrame(data)
out = df.loc[~df[['1','2']].apply(sorted, axis=1).duplicated()]
print(out)
1 2 3
0 A B 0.2
1 C D 0.1
2 E F 0.0
3 G H 0.5
5 I J 0.3
6 K L 0.6
Above solution assumes that you want to get rid of "duplicates" based on column 1
and 2
. However, if you want to look for "duplicates" on the entire df including col 3
, you'll need to convert all values to the same dtype (i.e. strings) first. So, in that case, you might do:
out = df.loc[~df.astype(str).apply(sorted, axis=1).duplicated()]
print(out)
1 2 3
0 A B 0.2
1 C D 0.1
2 E F 0.0
3 G H 0.5
5 I J 0.3
6 K L 0.6
N.B. This will not affect the dtype of col 3
in the actual result of course; it still contains float
values.