Home > Software design >  Dropping duplicates in a dataframe while keeping the oldest record
Dropping duplicates in a dataframe while keeping the oldest record

Time:11-04

My dataframe looks like this (it says the time format is datetime64[ns, UTC]):

name     job      feedback    question    time
a        j1       False       q1          2021-09-06 09:25:03.659000 00:00
a        j1       True        q1          2021-09-06 09:35:03.659000 00:00
a        j1       True        q2          2021-09-06 09:24:03.659000 00:00
b        j1       False       q1          2021-09-06 10:25:03.659000 00:00
b        j1       True        q1          2021-09-06 09:25:04.659000 00:00
c        j1       True        q3          2021-09-06 11:25:03.659000 00:00
c        j1       False       q3          2021-09-06 09:00:03.659000 00:00
d        j1       False       q2          2021-09-06 08:25:03.659000 00:00
d        j1       True        q2          2021-09-06 11:24:05.659000 00:00

I want to only look at the feedbacks that were given the first time a question was attempted by each name. So:

    name     job      feedback    question    time
    a        j1       False       q1          2021-09-06 09:25:03.659000 00:00
    a        j1       True        q2          2021-09-06 09:24:03.659000 00:00
    b        j1       True        q1          2021-09-06 09:25:04.659000 00:00
    c        j1       False       q3          2021-09-06 09:00:03.659000 00:00
    d        j1       False       q2          2021-09-06 08:25:03.659000 00:00

I tried:

deduped = df.drop_duplicates(subset=["name", "job", "question"], keep="first") 

But it doesn't always work since the records are in random orders sometimes.

Is it possible to sort by time first and then keep my one-liner, as above? Is there any other safe way to do it?

CodePudding user response:

Your intuition was correct, you can sort_values:

deduped = (df.sort_values(by='time')
             .drop_duplicates(subset=["name", "job", "question"], keep="first")
          )

output:

  name job  feedback question                              time
7    d  j1     False       q2  2021-09-06 08:25:03.659000 00:00
6    c  j1     False       q3  2021-09-06 09:00:03.659000 00:00
2    a  j1      True       q2  2021-09-06 09:24:03.659000 00:00
0    a  j1     False       q1  2021-09-06 09:25:03.659000 00:00
4    b  j1      True       q1  2021-09-06 09:25:04.659000 00:00
  • Related