I have a dataframe similar to this:
import pandas as pd
colA = ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
colB = [(21,1,2), (0,1,21), (2,1,21), (1,12,5), (21,1,0), (12,5,6), (18,7,14), (7,5,12), (14,7,18), (12,7,11), (11,7,12), (3,5,7)]
df = pd.DataFrame(list(zip(colA, colB)), columns = ['colA', 'colB'])
display(df)
output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (2, 1, 21)
3 a (1, 12, 5)
4 b (21, 1, 0)
5 b (12, 5, 6)
6 b (18, 7, 14)
7 b (7, 5, 12)
8 c (14, 7, 18)
9 c (12, 7, 11)
10 c (11, 7, 12)
11 c (3, 5, 7)
I'd need to drop (or filter out) all the rows where, within the same value of colA
, a value of colB
in a row is equal to the reverse value of colB
in another row.
In the example provided:
within
colA='a'
row 2 hascolB=(2,1,21)
which is the reverse of row 0colB=(21,1,2)
and thus should be droppedcolA='b'
row 4 hascolB=(21,1,0)
which is the reverse of row 1colB=(0,1,21)
but that'scolA='a'
so nothing to drop herewithin
colA='c'
row 10 hascolB=(11,7,12)
which is the reverse of row 9colB=(12,7,11)
and thus should be dropped
Final results would something like:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)
Observations:
Preferable to drop row on a duplicated dataframe and keep the original
Very important: my real dataframe has shape (3millions, 11), so I am looking for an efficient way to do this, like .apply, lambda etc..I did this in the past with df.iterrows, it was already not the best way, my bad..now it's completely unfeasible
Current df.iterrows solution:
unique_df = df.copy()
seen_a_b = set()
for i, row in df.iterrows():
val_a = row['colA']
val_b = row['colB']
a_b = (val_a, val_b)
a_revb = (val_a, val_b[::-1])
if a_b in seen_a_b:
unique_df.drop(i, inplace=True)
continue
seen_a_b.add(a_b)
seen_a_b.add(a_revb)
CodePudding user response:
Try this:
df.groupby(['colA',df['colB'].map(lambda x: frozenset((x[0],x[-1])))],as_index=False).first()
This solution creates a frozenset, or an immutable set that can be used as a groupby key. This, along with colA is used to get the first value of each group. We are only using the first and last value in colB, as the middle value is the same forward and backwards.
Output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)