date | name | fruits | |
---|---|---|---|
[email protected] | 01-07-2022 12:40:00 | james | kiwi |
[email protected] | 01-07-2022 12:10:00 | james | apple |
[email protected] | 01-07-2022 12:10:00 | james | orange |
[email protected] | 01-07-2022 12:30:00 | james | lemon |
[email protected] | 02-07-2022 08:04:00 | clark | apple |
[email protected] | 02-07-2022 08:04:00 | clark | orange |
[email protected] | 02-07-2022 08:13:00 | clark | kiwi |
[email protected] | 02-07-2022 08:28:00 | clark | lemon |
[email protected] | 02-06-2022 08:43:00 | beth | kiwi |
[email protected] | 02-06-2022 08:48:00 | beth | orange |
In the above dataframe, i want to drop rows associated with name/email-id and retain the earliest time prioritizing 'apple' over 'oranges' in the fruit column i.e., if two rows have the same time stamp and one row has the value 'apple' in the fruit column and the same time stamp but 'orange' in the fruit column, then i want to only retain the row relating to apple.
Output
date | name | completed | |
---|---|---|---|
[email protected] | 01-07-2022 12:10:00 | james | apple |
[email protected] | 02-07-2022 08:04:00 | clark | apple |
[email protected] | 02-06-2022 08:48:00 | beth | orange |
This is what i have tried;
res = (
df.sort_values(['completed', 'date'],
key=lambda col: col.eq('apple') if col.name == 'completed' else col)
.groupby(['email', 'name'], as_index=False)
.first()
)
Basically, i want to get the earliest 'date' for each name and email. Both 'name' and 'email' columns have duplicate values, the only differentiating factor is the values in the 'date' column. Out of all the values in the 'date' column for a particular person, i want to only retain the earliest date when fruits = ['orange', 'apple'] and drop all the other rows. when there are two rows with the same date but different fruits, i want to retain apple over orange.
CodePudding user response:
You can try this, create psuedo "sortkey" column, sort and then you can drop after filtering:
df.assign(sortkey=df['fruits'].isin(["apple","orange"]))\
.sort_values('sortkey', ascending=False)\
.drop_duplicates('email')
Output:
email date name fruits sortkey
1 [email protected] 01-07-2022 12:10:00 james apple True
4 [email protected] 02-07-2022 08:04:00 clark apple True
9 [email protected] 02-06-2022 08:48:00 beth orange True
CodePudding user response:
Like this:
df.drop_duplicates(subset=['email'], keep='first', inplace=True)