Home > OS >  Remove duplicates when values are swapped in columns and give a count
Remove duplicates when values are swapped in columns and give a count

Time:04-09

I have a DataFrame like this:

Col1 Col2 Score
A B 0.6
A B 0.6
B A 0.6
A C 0.8
C A 0.8
D E 0.9

I want to remove all duplicates even if the values are swapped in col1 and col2 and count such occurrences.

I know we can remove it with

df.drop_duplicates()

For cases when values are swapped but the meaning is same, then I am creating a temp column which is [col1,col2] and re-arranging the list in ascending order to delete it again using the duplicates method above.

I need to count all these duplicates too. Is there a better way to do this? The resultant DataFrame should be like this:

Col1 Col2 Duplicates Score
A B 3 0.6
A C 2 0.8
D E 1 0.9

CodePudding user response:

IIUC, you could use a frozenset as grouper:

group = df[['Col1', 'Col2']].agg(frozenset, axis=1)

(df
 .groupby(group, as_index=False)  # you can also group by [group, 'Score']
 .agg(**{c: (c, 'first') for c in df},
      Duplicates=('Score', 'count'),
     )
)

output:

  Col1 Col2  Score  Duplicates
0    A    B    0.6           3
1    A    C    0.8           2
2    D    E    0.9           1

CodePudding user response:

Here is an alternative way using np.sort

df[['Col1','Col2']] = np.sort(df[['Col1','Col2']].to_numpy(),axis=1)

(df.groupby(['Col1','Col2']).agg(
    Count = ('Score','count'),
    Score = ('Score','first'))
 .reset_index())

Output:

  Col1 Col2  Count  Score
0    A    B      3    0.6
1    A    C      2    0.8
2    D    E      1    0.9
  • Related