I have a DataFrame like this:
Col1 | Col2 | Score |
---|---|---|
A | B | 0.6 |
A | B | 0.6 |
B | A | 0.6 |
A | C | 0.8 |
C | A | 0.8 |
D | E | 0.9 |
I want to remove all duplicates even if the values are swapped in col1 and col2 and count such occurrences.
I know we can remove it with
df.drop_duplicates()
For cases when values are swapped but the meaning is same, then I am creating a temp column which is [col1,col2] and re-arranging the list in ascending order to delete it again using the duplicates method above.
I need to count all these duplicates too. Is there a better way to do this? The resultant DataFrame should be like this:
Col1 | Col2 | Duplicates | Score |
---|---|---|---|
A | B | 3 | 0.6 |
A | C | 2 | 0.8 |
D | E | 1 | 0.9 |
CodePudding user response:
IIUC, you could use a frozenset
as grouper:
group = df[['Col1', 'Col2']].agg(frozenset, axis=1)
(df
.groupby(group, as_index=False) # you can also group by [group, 'Score']
.agg(**{c: (c, 'first') for c in df},
Duplicates=('Score', 'count'),
)
)
output:
Col1 Col2 Score Duplicates
0 A B 0.6 3
1 A C 0.8 2
2 D E 0.9 1
CodePudding user response:
Here is an alternative way using np.sort
df[['Col1','Col2']] = np.sort(df[['Col1','Col2']].to_numpy(),axis=1)
(df.groupby(['Col1','Col2']).agg(
Count = ('Score','count'),
Score = ('Score','first'))
.reset_index())
Output:
Col1 Col2 Count Score
0 A B 3 0.6
1 A C 2 0.8
2 D E 1 0.9