Home > Software design >  Checking unique enteries and Removing Duplicates in pandas dataframe of user id interactions
Checking unique enteries and Removing Duplicates in pandas dataframe of user id interactions

Time:01-03

I have this dataframe which looks like this:

user_id : Represents user

question_id : Represent question number

user_answer : which option user has opted for the specific question from (A,B,C,D)

correct_answer: What is correct answer for that specific question

correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right

elapsed_time : it represents time in minutes user took to answer that question

user_iD question_id user_answer correct answer correct elapsed_time solving_id bundle_id
1 1 A D 0.0 5.00 1 1
1 1 B D 0.0 3.00 1 1
1 1 D D 1.0 7.00 1 1
2 10 C B 0.0 5.00 10 5
2 10 B B 1.0 15.0 10 5
2 10 B B 1.0 2.00 10 5
3 25 C A 0.0 5.00 25 7
3 25 A A 1.0 1.00 25 7

What I want another dataframe to look like is this

user_iD question_id user_answer correct answer correct elapsed_time solving_id bundle_id
1 1 D D 1.0 7.00 1 1
2 10 B B 1.0 15.0 10 5
3 25 A A 1.0 1.00 25 7

It should remove the duplicate questions attempted by a user and keeps the one which user corrected the very first time.

What I have done so far , this gives me an error

df = df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])

ERROR : AttributeError: 'DataFrame' object has no attribute 'drop_duplicated'

CodePudding user response:

drop_duplicated wrong typo. use drop_duplicates.

Also, check keep parameter to preserve row you want.

CodePudding user response:

For your boolean mask, you need duplicated instead of drop_duplicated. So instead of

df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])

you want

mask = df['correct'].eq(1) & ~df.duplicated(['user_iD','question_id','correct'])

This will give you a boolean mask. Then your desired outcome is obtained using this mask on df:

out = df[mask]

You can get the same outcome by filtering for correct first and then using drop_duplicates method:

out = df[df['correct']==1].drop_duplicates('correct answer'))

Output:

   user_iD  question_id user_answer correct answer  correct  elapsed_time  \
2        1            1           D              D      1.0           7.0   
4        2           10           B              B      1.0          15.0   
7        3           25           A              A      1.0           1.0   

   solving_id  bundle_id  
2           1          1  
4          10          5  
7          25          7  

CodePudding user response:

groupby and agg are good at conditionally isolating. In this case, filter the ones and find the max in each group of the user_iD

df[df['correct'].eq(1.0)].groupby(df['user_iD']).agg('max')

         
                  user_iD  question_id user_answer correct answer  

correct  \
user_iD                                                             
1              1            1           D              D      1.0   
2              2           10           B              B      1.0   
3              3           25           A              A      1.0   

         elapsed_time  solving_id  bundle_id  
user_iD                                       
1                 7.0           1          1  
2                15.0          10          5  
3                 1.0          25          7 

CodePudding user response:

Use Series.eq with Groupby.idxmax and df.loc:

In [174]: ix = df[df.correct.eq(1)].groupby(['user_iD', 'question_id'])['elapsed_time'].idxmax()

In [175]: df.loc[ix]
Out[175]: 
   user_iD  question_id user_answer correct answer  correct  elapsed_time  solving_id  bundle_id
2        1            1           D              D      1.0           7.0           1          1
4        2           10           B              B      1.0          15.0          10          5
7        3           25           A              A      1.0           1.0          25          7
  • Related