Home > Blockchain >  How to quickly identify duplicate rows in a python dataframe where columns may be flipped
How to quickly identify duplicate rows in a python dataframe where columns may be flipped

Time:12-22

I have a table with three columns, two of which may be flipped:

row A    B    Time
001 C1   DQ   300
002 C1   EG   355
003 DQ   C1   300
004 DQ   EG   400
005 IX   Q2   410
006 Q2   IX   410

I'd like to first identify the duplicate rows, and then secondly to randomly keep 1 of the duplicate entries.

For the first test, clearly row 001 / 003 are a pair, as are 005 / 006. To avoid bias, however, I will want to have a 50% probability of keeping row 001 or row 003, and the same in the second pairing.

My intuition is to loop row by row and create an AB row as follows sorted by the time and random variable and then select where AB <> lAB. It's obvious looking at the rows below that I would then end up with 4 unique combinations randomly keeping the appropriate A or B entry.

row AB      lAB     Time  rand_val
001 C1DQ    null    300   0.00031
003 C1DQ    C1DQ    300   0.37647
002 C1EQ    C1DQ    355   0.78345
004 DQEG    C1EQ    400   0.63145
006 IXQ2    DQEG    410   0.21364
005 IXQ2    IXQ2    410   0.91462

Unfortunately I have a very long dataset and extracting the values of A and B as I loop to create AB is going to mean multiple passes and a sort. Is there a more optimal "pandas" approach to this?

CodePudding user response:

Create sorted tuples from columns A and B then group rows with this keys and finally keep one randomly:

groups = df[['A', 'B']].apply(lambda x: tuple(sorted(x)), axis=1)
out = df.groupby(groups).sample(n=1)
print(out)

# Output:
   row   A   B  Time
2  003  DQ  C1   300
1  002  C1  EG   355
3  004  DQ  EG   400
5  006  Q2  IX   410

About groups:

>>> pd.concat([df[['A', 'B']], groups.rename('AB')], axis=1)

    A   B        AB
0  C1  DQ  (C1, DQ)
1  C1  EG  (C1, EG)
2  DQ  C1  (C1, DQ)  # swap here
3  DQ  EG  (DQ, EG)
4  IX  Q2  (IX, Q2)
5  Q2  IX  (IX, Q2)  # swap here

Note: @mozway suggests to use frozenset instead of tuple(sorted(...)), so replace:

groups = df[['A', 'B']].apply(lambda x: tuple(sorted(x)), axis=1)

by:

groups = df[['A', 'B']].apply(frozenset, axis=1)
  • Related