Checking unique enteries and Removing Duplicates in pandas dataframe of user id interactions-CodePudding

I have this dataframe which looks like this:

user_id : Represents user

question_id : Represent question number

user_answer : which option user has opted for the specific question from (A,B,C,D)

correct_answer: What is correct answer for that specific question

correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right

elapsed_time : it represents time in minutes user took to answer that question

user_iD	question_id	user_answer	correct answer	correct	elapsed_time	solving_id	bundle_id
1	1	A	D	0.0	5.00	1	1
1	1	B	D	0.0	3.00	1	1
1	1	D	D	1.0	7.00	1	1
2	10	C	B	0.0	5.00	10	5
2	10	B	B	1.0	15.0	10	5
2	10	B	B	1.0	2.00	10	5
3	25	C	A	0.0	5.00	25	7
3	25	A	A	1.0	1.00	25	7

What I want another dataframe to look like is this

user_iD	question_id	user_answer	correct answer	correct	elapsed_time	solving_id	bundle_id
1	1	D	D	1.0	7.00	1	1
2	10	B	B	1.0	15.0	10	5
3	25	A	A	1.0	1.00	25	7

It should remove the duplicate questions attempted by a user and keeps the one which user corrected the very first time.

What I have done so far , this gives me an error

df = df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])

ERROR : AttributeError: 'DataFrame' object has no attribute 'drop_duplicated'

CodePudding user response：

drop_duplicated wrong typo. use drop_duplicates.

Also, check keep parameter to preserve row you want.

CodePudding user response：

For your boolean mask, you need duplicated instead of drop_duplicated. So instead of

df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])

you want

mask = df['correct'].eq(1) & ~df.duplicated(['user_iD','question_id','correct'])

This will give you a boolean mask. Then your desired outcome is obtained using this mask on df:

out = df[mask]

You can get the same outcome by filtering for correct first and then using drop_duplicates method:

out = df[df['correct']==1].drop_duplicates('correct answer'))

Output:

   user_iD  question_id user_answer correct answer  correct  elapsed_time  \
2        1            1           D              D      1.0           7.0   
4        2           10           B              B      1.0          15.0   
7        3           25           A              A      1.0           1.0   

   solving_id  bundle_id  
2           1          1  
4          10          5  
7          25          7

CodePudding user response：

groupby and agg are good at conditionally isolating. In this case, filter the ones and find the max in each group of the user_iD

df[df['correct'].eq(1.0)].groupby(df['user_iD']).agg('max')

         
                  user_iD  question_id user_answer correct answer  

correct  \
user_iD                                                             
1              1            1           D              D      1.0   
2              2           10           B              B      1.0   
3              3           25           A              A      1.0   

         elapsed_time  solving_id  bundle_id  
user_iD                                       
1                 7.0           1          1  
2                15.0          10          5  
3                 1.0          25          7

CodePudding user response：

Use Series.eq with Groupby.idxmax and df.loc:

In [174]: ix = df[df.correct.eq(1)].groupby(['user_iD', 'question_id'])['elapsed_time'].idxmax()

In [175]: df.loc[ix]
Out[175]: 
   user_iD  question_id user_answer correct answer  correct  elapsed_time  solving_id  bundle_id
2        1            1           D              D      1.0           7.0           1          1
4        2           10           B              B      1.0          15.0          10          5
7        3           25           A              A      1.0           1.0          25          7