date | name | completed | |
---|---|---|---|
[email protected] | 01-07-2022 12:40:00 | james | no |
[email protected] | 01-07-2022 12:10:00 | james | yes |
[email protected] | 01-07-2022 12:19:00 | james | yes |
[email protected] | 01-07-2022 12:30:00 | james | no |
[email protected] | 02-07-2022 08:04:00 | clark | yes |
[email protected] | 02-07-2022 08:08:00 | clark | yes |
[email protected] | 02-07-2022 08:13:00 | clark | no |
[email protected] | 02-07-2022 08:28:00 | clark | no |
In the above dataframe, i want to drop multiple rows associated with name/email-id and retain the earliest time.
Output
date | name | completed | |
---|---|---|---|
[email protected] | 01-07-2022 12:10:00 | james | yes |
[email protected] | 01-07-2022 08:04:00 | clark | yes |
This is what i have tried;
df = df.sort_values('date')
df = df.groupby('date').first()
This ends up creating date as an index. I don't want the column i am sorting on to be created as an index, i want to sort in place. Basically, i want to get the earliest 'date' for each name and email. Both 'name' and 'email' columns have duplicate values, the only differentiating factor is the values in the 'date' column. Out of all the values in the 'date' column for a particular person, i want to only retain the earliest date when completed = 'yes' and drop all the other rows.
CodePudding user response:
df.sort_values('date').groupby(['email', 'name']).first().reset_index()
gives
index | name | date | completed | |
---|---|---|---|---|
0 | [email protected] | james | 01-07-2022 12:10:00 | yes |
1 | [email protected] | clark | 02-07-2022 08:04:00 | yes |
CodePudding user response:
The other answers do not necessarily satisfy the requirement:
i want to only retain the earliest date when completed = 'yes' and drop all the other rows.
They only work for the example provided given the correct order of the rows. I wouldn't rely on that.
To do so, we first have to select only the rows with completed == 'yes'
. For instance:
res = (
df[df.completed.eq('yes')]
.sort_values('date')
.groupby(['email', 'name'], as_index=False)
.first()
)
However, assuming that there may be groups (name
, email
) that do not contain rows with completed == 'yes'
, but should not be deleted (i.e. only the requirement of the most recent date should be considered in this case) the logic of sort_values
should be changed:
# sort first by df.completed == 'no', then by df.date in ascending order;
# and then select the first row for each group (name, email)
res = (
df.sort_values(['completed', 'date'],
key=lambda col: col.eq('no') if col.name == 'completed' else col)
.groupby(['email', 'name'], as_index=False)
.first()
)
Output:
>>> res
email name date completed
0 [email protected] james 01-07-2022 12:10:00 yes
1 [email protected] clark 02-07-2022 08:04:00 yes
CodePudding user response:
Try .groupby
followed by .idxmin()
:
print(df[df.completed == "yes"].loc[df.groupby(["email", "name"])["date"].idxmin()])
Prints:
email date name completed
1 [email protected] 2022-01-07 12:10:00 james yes
4 [email protected] 2022-02-07 08:04:00 clark yes
Or: sort values by date
and then .drop_duplicates
:
df = df[df.completed == "yes"].sort_values("date").drop_duplicates(["email", "name"], keep="first")
print(df)
EDIT: Added df[df.completed == "yes"]
part.