Home > Net >  Check for duplicate rows for a subset of columns in a Pandas DataFrameGroupBy object
Check for duplicate rows for a subset of columns in a Pandas DataFrameGroupBy object

Time:05-27

Suppose I have a groupby object (grouped on Col1) like below:

Col1     Col2     Col3     Col4     Col5
----------------------------------------
AAA       001      456      846      239     row1
          002      374      238      904     row2
          003      456      846      239     row3

BBB       001      923      222      398     row1
          002      923      222      398     row2
          003      755      656      949     row3

CCC       001      324      454      565     row1
          002      744      345      336     row2
          003      567      355      756     row3

Is there a way to check for duplicate rows based on [Col3, Col4, Col5] within each group. In the example above, for group AAA, row 1 and row 3 match since Col3, Col4, Col5 values are same. Again in group BBB, row 1 and row 2 match. In group CCC, using above logic, we don't have any duplicate rows.

Can we create a list with 'n' boolean elements (where 'n' represents the number of groups), fill it with True if there is any duplicate for that group, else False. Therefore, for above example output would be:

[True, True, False]

CodePudding user response:

You can try groupby column Col1 then use duplicated() to check if there are any duplicated from Col3 to Col5

out = (df.groupby('Col1')
       .apply(lambda g: g[['Col3','Col4','Col5']].duplicated().any())
       .tolist())
print(out)

[True, True, False]
  • Related