I have a table with three columns, two of which may be flipped:
row A B Time
001 C1 DQ 300
002 C1 EG 355
003 DQ C1 300
004 DQ EG 400
005 IX Q2 410
006 Q2 IX 410
I'd like to first identify the duplicate rows, and then secondly to randomly keep 1 of the duplicate entries.
For the first test, clearly row 001 / 003 are a pair, as are 005 / 006. To avoid bias, however, I will want to have a 50% probability of keeping row 001 or row 003, and the same in the second pairing.
My intuition is to loop row by row and create an AB row as follows sorted by the time and random variable and then select where AB <> lAB. It's obvious looking at the rows below that I would then end up with 4 unique combinations randomly keeping the appropriate A or B entry.
row AB lAB Time rand_val
001 C1DQ null 300 0.00031
003 C1DQ C1DQ 300 0.37647
002 C1EQ C1DQ 355 0.78345
004 DQEG C1EQ 400 0.63145
006 IXQ2 DQEG 410 0.21364
005 IXQ2 IXQ2 410 0.91462
Unfortunately I have a very long dataset and extracting the values of A and B as I loop to create AB is going to mean multiple passes and a sort. Is there a more optimal "pandas" approach to this?
CodePudding user response:
Create sorted tuples from columns A
and B
then group rows with this keys and finally keep one randomly:
groups = df[['A', 'B']].apply(lambda x: tuple(sorted(x)), axis=1)
out = df.groupby(groups).sample(n=1)
print(out)
# Output:
row A B Time
2 003 DQ C1 300
1 002 C1 EG 355
3 004 DQ EG 400
5 006 Q2 IX 410
About groups
:
>>> pd.concat([df[['A', 'B']], groups.rename('AB')], axis=1)
A B AB
0 C1 DQ (C1, DQ)
1 C1 EG (C1, EG)
2 DQ C1 (C1, DQ) # swap here
3 DQ EG (DQ, EG)
4 IX Q2 (IX, Q2)
5 Q2 IX (IX, Q2) # swap here
Note: @mozway suggests to use frozenset
instead of tuple(sorted(...))
, so replace:
groups = df[['A', 'B']].apply(lambda x: tuple(sorted(x)), axis=1)
by:
groups = df[['A', 'B']].apply(frozenset, axis=1)