My dataframe looks like this (it says the time format is datetime64[ns, UTC]
):
name job feedback question time
a j1 False q1 2021-09-06 09:25:03.659000 00:00
a j1 True q1 2021-09-06 09:35:03.659000 00:00
a j1 True q2 2021-09-06 09:24:03.659000 00:00
b j1 False q1 2021-09-06 10:25:03.659000 00:00
b j1 True q1 2021-09-06 09:25:04.659000 00:00
c j1 True q3 2021-09-06 11:25:03.659000 00:00
c j1 False q3 2021-09-06 09:00:03.659000 00:00
d j1 False q2 2021-09-06 08:25:03.659000 00:00
d j1 True q2 2021-09-06 11:24:05.659000 00:00
I want to only look at the feedback
s that were given the first time a question
was attempted by each name
. So:
name job feedback question time
a j1 False q1 2021-09-06 09:25:03.659000 00:00
a j1 True q2 2021-09-06 09:24:03.659000 00:00
b j1 True q1 2021-09-06 09:25:04.659000 00:00
c j1 False q3 2021-09-06 09:00:03.659000 00:00
d j1 False q2 2021-09-06 08:25:03.659000 00:00
I tried:
deduped = df.drop_duplicates(subset=["name", "job", "question"], keep="first")
But it doesn't always work since the records are in random orders sometimes.
Is it possible to sort by time first and then keep my one-liner, as above? Is there any other safe way to do it?
CodePudding user response:
Your intuition was correct, you can sort_values
:
deduped = (df.sort_values(by='time')
.drop_duplicates(subset=["name", "job", "question"], keep="first")
)
output:
name job feedback question time
7 d j1 False q2 2021-09-06 08:25:03.659000 00:00
6 c j1 False q3 2021-09-06 09:00:03.659000 00:00
2 a j1 True q2 2021-09-06 09:24:03.659000 00:00
0 a j1 False q1 2021-09-06 09:25:03.659000 00:00
4 b j1 True q1 2021-09-06 09:25:04.659000 00:00