I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...
Date | From | Subject | Source |
---|---|---|---|
12/06/21 | Sender1 | Test123 | Inbox |
12/06/21 | Sender2 | Confirm | Inbox |
12/06/21 | Sender1 | Test123 | Sent |
12/06/21 | Sender3 | Test_on | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
Practically from the table above the rows with subject = 'Test123' should be dropped.
Date | From | Subject | Source |
---|---|---|---|
12/06/21 | Sender2 | Confirm | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
CodePudding user response:
You can use set
to determine for each sender if there are multiple source. If yes, drop the row.
>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]
Date From Subject Source
1 12/06/21 Sender2 Confirm Inbox
3 12/06/21 Sender3 Test_on Inbox
4 12/06/21 Sender3 Test_on Inbox
CodePudding user response:
duplicated_subject = df.duplicated('Subject', keep=False)
duplicated_subject_and_source = df.duplicated(['Subject', 'Source'], keep=False)
df[~duplicated_subject | duplicated_subject_and_source]
eliminate all duplicates where "Subject is the same AND Source is different"
is equivalent to
keep where "Subject is not duplicated OR Subject is duplicated and Source is the same"