Python Pandas drop row duplicates on a column if no duplicate on other column-CodePudding

I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...

Date	From	Subject	Source
12/06/21	Sender1	Test123	Inbox
12/06/21	Sender2	Confirm	Inbox
12/06/21	Sender1	Test123	Sent
12/06/21	Sender3	Test_on	Inbox
12/06/21	Sender3	Test_on	Inbox

Practically from the table above the rows with subject = 'Test123' should be dropped.

Date	From	Subject	Source
12/06/21	Sender2	Confirm	Inbox
12/06/21	Sender3	Test_on	Inbox
12/06/21	Sender3	Test_on	Inbox

CodePudding user response：

You can use set to determine for each sender if there are multiple source. If yes, drop the row.

>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]

       Date     From  Subject Source
1  12/06/21  Sender2  Confirm  Inbox
3  12/06/21  Sender3  Test_on  Inbox
4  12/06/21  Sender3  Test_on  Inbox

CodePudding user response：

duplicated_subject = df.duplicated('Subject', keep=False)
duplicated_subject_and_source = df.duplicated(['Subject', 'Source'], keep=False)
df[~duplicated_subject | duplicated_subject_and_source]

eliminate all duplicates where "Subject is the same AND Source is different"

is equivalent to

keep where "Subject is not duplicated OR Subject is duplicated and Source is the same"