Home > database >  Drop duplicate rows from a dataframe based on values in multiple columns
Drop duplicate rows from a dataframe based on values in multiple columns

Time:08-24

email date name completed
[email protected] 01-07-2022 12:40:00 james no
[email protected] 01-07-2022 12:10:00 james yes
[email protected] 01-07-2022 12:19:00 james yes
[email protected] 01-07-2022 12:30:00 james no
[email protected] 02-07-2022 08:04:00 clark yes
[email protected] 02-07-2022 08:08:00 clark yes
[email protected] 02-07-2022 08:13:00 clark no
[email protected] 02-07-2022 08:28:00 clark no

In the above dataframe, i want to drop multiple rows associated with name/email-id and retain the earliest time.

Output

email date name completed
[email protected] 01-07-2022 12:10:00 james yes
[email protected] 01-07-2022 08:04:00 clark yes

This is what i have tried;

df = df.sort_values('date')
df = df.groupby('date').first()

This ends up creating date as an index. I don't want the column i am sorting on to be created as an index, i want to sort in place. Basically, i want to get the earliest 'date' for each name and email. Both 'name' and 'email' columns have duplicate values, the only differentiating factor is the values in the 'date' column. Out of all the values in the 'date' column for a particular person, i want to only retain the earliest date when completed = 'yes' and drop all the other rows.

CodePudding user response:

df.sort_values('date').groupby(['email', 'name']).first().reset_index()

gives

index email name date completed
0 [email protected] james 01-07-2022 12:10:00 yes
1 [email protected] clark 02-07-2022 08:04:00 yes

CodePudding user response:

The other answers do not necessarily satisfy the requirement:

i want to only retain the earliest date when completed = 'yes' and drop all the other rows.

They only work for the example provided given the correct order of the rows. I wouldn't rely on that.

To do so, we first have to select only the rows with completed == 'yes'. For instance:

res = (
    df[df.completed.eq('yes')]
      .sort_values('date')
      .groupby(['email', 'name'], as_index=False)
      .first()
)

However, assuming that there may be groups (name, email) that do not contain rows with completed == 'yes', but should not be deleted (i.e. only the requirement of the most recent date should be considered in this case) the logic of sort_values should be changed:

# sort first by df.completed == 'no', then by df.date in ascending order; 
# and then select the first row for each group (name, email)
res = (
    df.sort_values(['completed', 'date'], 
                   key=lambda col: col.eq('no') if col.name == 'completed' else col)
      .groupby(['email', 'name'], as_index=False)
      .first()
)

Output:

>>> res

         email   name                 date completed
0  [email protected]  james  01-07-2022 12:10:00       yes
1  [email protected]  clark  02-07-2022 08:04:00       yes

CodePudding user response:

Try .groupby followed by .idxmin():

print(df[df.completed == "yes"].loc[df.groupby(["email", "name"])["date"].idxmin()])

Prints:

         email                date   name completed
1  [email protected] 2022-01-07 12:10:00  james       yes
4  [email protected] 2022-02-07 08:04:00  clark       yes

Or: sort values by date and then .drop_duplicates:

df = df[df.completed == "yes"].sort_values("date").drop_duplicates(["email", "name"], keep="first")
print(df)

EDIT: Added df[df.completed == "yes"] part.

  • Related