Home > Mobile >  Avoid df.iterrow to drop dataframe rows within certain conditions
Avoid df.iterrow to drop dataframe rows within certain conditions

Time:11-28

I have a dataframe similar to this:

import pandas as pd

colA = ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
colB = [(21,1,2), (0,1,21), (2,1,21), (1,12,5), (21,1,0), (12,5,6), (18,7,14), (7,5,12), (14,7,18), (12,7,11), (11,7,12), (3,5,7)]
df = pd.DataFrame(list(zip(colA, colB)), columns = ['colA', 'colB'])
display(df)

output:

    colA    colB
0   a   (21, 1, 2)
1   a   (0, 1, 21)
2   a   (2, 1, 21)
3   a   (1, 12, 5)
4   b   (21, 1, 0)
5   b   (12, 5, 6)
6   b   (18, 7, 14)
7   b   (7, 5, 12)
8   c   (14, 7, 18)
9   c   (12, 7, 11)
10  c   (11, 7, 12)
11  c   (3, 5, 7)

I'd need to drop (or filter out) all the rows where, within the same value of colA, a value of colB in a row is equal to the reverse value of colB in another row.
In the example provided:

  • within colA='a' row 2 has colB=(2,1,21) which is the reverse of row 0 colB=(21,1,2) and thus should be dropped

  • colA='b' row 4 has colB=(21,1,0) which is the reverse of row 1 colB=(0,1,21) but that's colA='a' so nothing to drop here

  • within colA='c' row 10 has colB=(11,7,12) which is the reverse of row 9 colB=(12,7,11) and thus should be dropped

Final results would something like:

    colA    colB
0   a   (21, 1, 2)
1   a   (0, 1, 21)
2   a   (1, 12, 5)
3   b   (21, 1, 0)
4   b   (12, 5, 6)
5   b   (18, 7, 14)
6   b   (7, 5, 12)
7   c   (14, 7, 18)
8   c   (12, 7, 11)
9   c   (3, 5, 7)

Observations:

  1. Preferable to drop row on a duplicated dataframe and keep the original

  2. Very important: my real dataframe has shape (3millions, 11), so I am looking for an efficient way to do this, like .apply, lambda etc..I did this in the past with df.iterrows, it was already not the best way, my bad..now it's completely unfeasible

Current df.iterrows solution:

unique_df = df.copy()
seen_a_b = set()
for i, row in df.iterrows():
    val_a = row['colA']
    val_b = row['colB']
    a_b = (val_a, val_b)
    a_revb = (val_a, val_b[::-1])
    if a_b in seen_a_b:
        unique_df.drop(i, inplace=True)
        continue
    seen_a_b.add(a_b)
    seen_a_b.add(a_revb)

CodePudding user response:

Try this:

df.groupby(['colA',df['colB'].map(lambda x: frozenset((x[0],x[-1])))],as_index=False).first()

This solution creates a frozenset, or an immutable set that can be used as a groupby key. This, along with colA is used to get the first value of each group. We are only using the first and last value in colB, as the middle value is the same forward and backwards.

Output:

  colA         colB
0    a   (21, 1, 2)
1    a   (0, 1, 21)
2    a   (1, 12, 5)
3    b   (21, 1, 0)
4    b   (12, 5, 6)
5    b  (18, 7, 14)
6    b   (7, 5, 12)
7    c  (14, 7, 18)
8    c  (12, 7, 11)
9    c    (3, 5, 7)
  • Related