Home > Blockchain >  drop duplicate rows based on column values
drop duplicate rows based on column values

Time:08-26

email date name fruits
[email protected] 01-07-2022 12:40:00 james kiwi
[email protected] 01-07-2022 12:10:00 james apple
[email protected] 01-07-2022 12:10:00 james orange
[email protected] 01-07-2022 12:30:00 james lemon
[email protected] 02-07-2022 08:04:00 clark apple
[email protected] 02-07-2022 08:04:00 clark orange
[email protected] 02-07-2022 08:13:00 clark kiwi
[email protected] 02-07-2022 08:28:00 clark lemon
[email protected] 02-06-2022 08:43:00 beth kiwi
[email protected] 02-06-2022 08:48:00 beth orange

In the above dataframe, i want to drop rows associated with name/email-id and retain the earliest time prioritizing 'apple' over 'oranges' in the fruit column i.e., if two rows have the same time stamp and one row has the value 'apple' in the fruit column and the same time stamp but 'orange' in the fruit column, then i want to only retain the row relating to apple.

Output

email date name completed
[email protected] 01-07-2022 12:10:00 james apple
[email protected] 02-07-2022 08:04:00 clark apple
[email protected] 02-06-2022 08:48:00 beth orange

This is what i have tried;

res = (
    df.sort_values(['completed', 'date'], 
                   key=lambda col: col.eq('apple') if col.name == 'completed' else col)
      .groupby(['email', 'name'], as_index=False)
      .first()
)

Basically, i want to get the earliest 'date' for each name and email. Both 'name' and 'email' columns have duplicate values, the only differentiating factor is the values in the 'date' column. Out of all the values in the 'date' column for a particular person, i want to only retain the earliest date when fruits = ['orange', 'apple'] and drop all the other rows. when there are two rows with the same date but different fruits, i want to retain apple over orange.

CodePudding user response:

You can try this, create psuedo "sortkey" column, sort and then you can drop after filtering:

df.assign(sortkey=df['fruits'].isin(["apple","orange"]))\
  .sort_values('sortkey', ascending=False)\
  .drop_duplicates('email')

Output:

         email                 date   name  fruits  sortkey
1  [email protected]  01-07-2022 12:10:00  james   apple     True
4  [email protected]  02-07-2022 08:04:00  clark   apple     True
9  [email protected]  02-06-2022 08:48:00   beth  orange     True

CodePudding user response:

Like this:

df.drop_duplicates(subset=['email'], keep='first', inplace=True)
  • Related