Checking unique enteries in pandas dataframe of user id interactions-CodePudding

I have this dataframe which looks like this:

user_id : Represents user

question_id : Represent question number

user_answer : which option user has opted for the specific question from (A,B,C,D)

correct_answer: What is correct answer for that specific question

correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right

elapsed_time : it represents time in minutes user took to answer that question

user_id	question_id	user_answer	correct_answer	correct	elapsed_time
1	130	A	B	0.0	2.00
1	130	B	B	1.0	5.00
1	130	B	B	1.0	2.00
2	10	C	D	0.0	7.00
2	10	A	D	0.0	9.00
2	10	B	D	0.0	13.00
2	10	D	D	1.0	4.00
2	10	D	D	1.0	1.50

What I want another dataframe to look like is this

user_id	Questions Attempted	Unique Questions Attempted	Percentage
1	9000	6000	=(6000/13169)*100
2	5000	4800	=(5000/13169)*100
5	12000	10000	=(10000/13169)*100
15	1000	30	=(30/13169)*100
23	255	255	=(255/13169)*100

WHAT I HAVE DONE SO FAR IS THIS

df_total_questions_attempted = df.groupby(['user_iD'], as_index=False , sort=False)['question_id'].count()

df_total_questions_attempted = df_total_questions_attempted.rename(columns={'question_id': 'Total Questions Attempted'})

This gives me the Questions Attempted column , How do i find the Unique Questions Attempted column?

and for Unique Questions Attempted column , I need that unique entry when the user has corrected the answer.

For example :

user_id	question_id	user_answer	correct_answer	correct	elapsed_time
1	130	A	B	0.0	2.00
1	130	B	B	1.0	5.00
1	130	B	B	1.0	2.00
2	10	C	D	0.0	7.00
2	10	A	D	0.0	9.00
2	10	D	D	1.0	4.00
2	10	D	D	1.0	1.50

In this dataframe : the unique question attempt should consider when the user has corrected the question for the very first time , as in this case user 1 has corrected the question in 2nd and 3rd attempt. it should consider the 2nd attempt. same goes for user 2 as well , user 2 has corrected the question in 3rd and 4th attempt , it should consider the 3rd attempt

CodePudding user response：

Create helper column for first correct answer per user_id and question_id by chain mask for test == by Series.eq with DataFrame.duplicated:

df['new'] = df['correct'].eq(1) & ~df.duplicated(['user_id','question_id','correct'])
print (df)
   user_id  question_id user_answer correct_answer  correct  elapsed_time  \
0        1          130           A              B      0.0           2.0   
1        1          130           B              B      1.0           5.0   
2        1          130           B              B      1.0           2.0   
3        2           10           C              D      0.0           7.0   
4        2           10           A              D      0.0           9.0   
5        2           10           B              D      0.0          13.0   
6        2           10           D              D      1.0           4.0   
7        2           10           D              D      1.0           1.5   

     new  
0  False  
1   True  
2  False  
3  False  
4  False  
5  False  
6   True  
7  False

And then for count Trues aggregate sum and for count unique questions use DataFrameGroupBy.nunique:

df1 = (df.groupby(['user_id'])
        .agg(**{'Questions Attempted':('question_id','nunique'),
                'Unique Questions Attempted':('new','sum')})
        .reset_index())

Last seems for percentage is necessary divide by number of unique questions:

no_uniq_q = df['correct'].nunique()
df1['Percentage'] = df1['Unique Questions Attempted'].div(no_uniq_q).mul(100)
print (df1)
   user_id  Questions Attempted  Unique Questions Attempted  Percentage
0        1                    1                           1        50.0
1        2                    1                           1        50.0