drop duplicate rows based on column values-CodePudding

email	date	name	fruits
[email protected]	01-07-2022 12:40:00	james	kiwi
[email protected]	01-07-2022 12:10:00	james	apple
[email protected]	01-07-2022 12:10:00	james	orange
[email protected]	01-07-2022 12:30:00	james	lemon
[email protected]	02-07-2022 08:04:00	clark	apple
[email protected]	02-07-2022 08:04:00	clark	orange
[email protected]	02-07-2022 08:13:00	clark	kiwi
[email protected]	02-07-2022 08:28:00	clark	lemon
[email protected]	02-06-2022 08:43:00	beth	kiwi
[email protected]	02-06-2022 08:48:00	beth	orange

In the above dataframe, i want to drop rows associated with name/email-id and retain the earliest time prioritizing 'apple' over 'oranges' in the fruit column i.e., if two rows have the same time stamp and one row has the value 'apple' in the fruit column and the same time stamp but 'orange' in the fruit column, then i want to only retain the row relating to apple.

Output

email	date	name	completed
[email protected]	01-07-2022 12:10:00	james	apple
[email protected]	02-07-2022 08:04:00	clark	apple
[email protected]	02-06-2022 08:48:00	beth	orange

This is what i have tried;

res = (
    df.sort_values(['completed', 'date'], 
                   key=lambda col: col.eq('apple') if col.name == 'completed' else col)
      .groupby(['email', 'name'], as_index=False)
      .first()
)

Basically, i want to get the earliest 'date' for each name and email. Both 'name' and 'email' columns have duplicate values, the only differentiating factor is the values in the 'date' column. Out of all the values in the 'date' column for a particular person, i want to only retain the earliest date when fruits = ['orange', 'apple'] and drop all the other rows. when there are two rows with the same date but different fruits, i want to retain apple over orange.

CodePudding user response：

You can try this, create psuedo "sortkey" column, sort and then you can drop after filtering:

df.assign(sortkey=df['fruits'].isin(["apple","orange"]))\
  .sort_values('sortkey', ascending=False)\
  .drop_duplicates('email')

Output:

         email                 date   name  fruits  sortkey
1  [email protected]  01-07-2022 12:10:00  james   apple     True
4  [email protected]  02-07-2022 08:04:00  clark   apple     True
9  [email protected]  02-06-2022 08:48:00   beth  orange     True

CodePudding user response：

Like this:

df.drop_duplicates(subset=['email'], keep='first', inplace=True)