I have a dataframe that looks like this:
user institution group result quiz
a Company 1a True zzz
a Company 1a False yyy
a Company 1a True yyy
a Company 1a False www
b Company 1a False www
c Company 1b False yyy
c Company 1b True yyy
d Company 1c True qqq
d Company 1c True zzz
d Company 1c True yyy
e Company 1c False zzz
e Company 1c False yyy
e Company 1c False yyy
The institution is the same (in the original dataset I have multiple institutions, but I split the dataset into multiple dataframes like the one above). The users can answer multiple quizzes multiple times. Each user can only be in one group.
How could I only take into consideration the first answer of each user to every quiz he answers?
user institution group FirstResult quiz
a Company 1a True zzz
a Company 1a False yyy
a Company 1a False www
b Company 1a False www
c Company 1b False yyy
d Company 1c True qqq
d Company 1c True zzz
d Company 1c True yyy
e Company 1c False zzz
e Company 1c False yyy
I tried to naively and manually delete the rows using df.drop
, but this impossible when the dataframe is big. The condition would be basically that user
and quiz
occur a second, third, nth time.
CodePudding user response:
Pandas has the DataFrame.drop_duplicates()
function:
deduped = df.drop_duplicates(subset=["user", "institution", "group", "quiz"], keep="first")