Home > OS >  Deleting rows where two column values appeared before
Deleting rows where two column values appeared before

Time:10-14

I have a dataframe that looks like this:

user    institution      group       result      quiz
a       Company          1a          True        zzz
a       Company          1a          False       yyy
a       Company          1a          True        yyy
a       Company          1a          False       www
b       Company          1a          False       www
c       Company          1b          False       yyy
c       Company          1b          True        yyy
d       Company          1c          True        qqq
d       Company          1c          True        zzz
d       Company          1c          True        yyy
e       Company          1c          False       zzz
e       Company          1c          False       yyy
e       Company          1c          False       yyy

The institution is the same (in the original dataset I have multiple institutions, but I split the dataset into multiple dataframes like the one above). The users can answer multiple quizzes multiple times. Each user can only be in one group.

How could I only take into consideration the first answer of each user to every quiz he answers?

    user    institution      group       FirstResult      quiz
    a       Company          1a          True             zzz
    a       Company          1a          False            yyy
    a       Company          1a          False            www
    b       Company          1a          False            www
    c       Company          1b          False            yyy
    d       Company          1c          True             qqq
    d       Company          1c          True             zzz
    d       Company          1c          True             yyy
    e       Company          1c          False            zzz
    e       Company          1c          False            yyy

I tried to naively and manually delete the rows using df.drop, but this impossible when the dataframe is big. The condition would be basically that user and quiz occur a second, third, nth time.

CodePudding user response:

Pandas has the DataFrame.drop_duplicates() function:

deduped = df.drop_duplicates(subset=["user", "institution", "group", "quiz"], keep="first")
  • Related