Home > Net >  Python Pandas drop row duplicates on a column if no duplicate on other column
Python Pandas drop row duplicates on a column if no duplicate on other column

Time:12-07

I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...

Date From Subject Source
12/06/21 Sender1 Test123 Inbox
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender1 Test123 Sent
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox

Practically from the table above the rows with subject = 'Test123' should be dropped.

Date From Subject Source
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox

CodePudding user response:

You can use set to determine for each sender if there are multiple source. If yes, drop the row.

>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]

       Date     From  Subject Source
1  12/06/21  Sender2  Confirm  Inbox
3  12/06/21  Sender3  Test_on  Inbox
4  12/06/21  Sender3  Test_on  Inbox

CodePudding user response:

duplicated_subject = df.duplicated('Subject', keep=False)
duplicated_subject_and_source = df.duplicated(['Subject', 'Source'], keep=False)
df[~duplicated_subject | duplicated_subject_and_source]

eliminate all duplicates where "Subject is the same AND Source is different"

is equivalent to

keep where "Subject is not duplicated OR Subject is duplicated and Source is the same"

  • Related