I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
user_id | question_id | user_answer | correct_answer | correct | elapsed_time |
---|---|---|---|---|---|
1 | 130 | A | B | 0.0 | 2.00 |
1 | 130 | B | B | 1.0 | 5.00 |
1 | 130 | B | B | 1.0 | 2.00 |
2 | 10 | C | D | 0.0 | 7.00 |
2 | 10 | A | D | 0.0 | 9.00 |
2 | 10 | B | D | 0.0 | 13.00 |
2 | 10 | D | D | 1.0 | 4.00 |
2 | 10 | D | D | 1.0 | 1.50 |
What I want another dataframe to look like is this
user_id | Questions Attempted | Unique Questions Attempted | Percentage |
---|---|---|---|
1 | 9000 | 6000 | =(6000/13169)*100 |
2 | 5000 | 4800 | =(5000/13169)*100 |
5 | 12000 | 10000 | =(10000/13169)*100 |
15 | 1000 | 30 | =(30/13169)*100 |
23 | 255 | 255 | =(255/13169)*100 |
WHAT I HAVE DONE SO FAR IS THIS
df_total_questions_attempted = df.groupby(['user_iD'], as_index=False , sort=False)['question_id'].count()
df_total_questions_attempted = df_total_questions_attempted.rename(columns={'question_id': 'Total Questions Attempted'})
This gives me the Questions Attempted column , How do i find the Unique Questions Attempted column?
and for Unique Questions Attempted column , I need that unique entry when the user has corrected the answer.
For example :
user_id | question_id | user_answer | correct_answer | correct | elapsed_time |
---|---|---|---|---|---|
1 | 130 | A | B | 0.0 | 2.00 |
1 | 130 | B | B | 1.0 | 5.00 |
1 | 130 | B | B | 1.0 | 2.00 |
2 | 10 | C | D | 0.0 | 7.00 |
2 | 10 | A | D | 0.0 | 9.00 |
2 | 10 | D | D | 1.0 | 4.00 |
2 | 10 | D | D | 1.0 | 1.50 |
In this dataframe : the unique question attempt should consider when the user has corrected the question for the very first time , as in this case user 1 has corrected the question in 2nd and 3rd attempt. it should consider the 2nd attempt. same goes for user 2 as well , user 2 has corrected the question in 3rd and 4th attempt , it should consider the 3rd attempt
CodePudding user response:
Create helper column for first correct answer per user_id
and question_id
by chain mask for test ==
by Series.eq
with DataFrame.duplicated
:
df['new'] = df['correct'].eq(1) & ~df.duplicated(['user_id','question_id','correct'])
print (df)
user_id question_id user_answer correct_answer correct elapsed_time \
0 1 130 A B 0.0 2.0
1 1 130 B B 1.0 5.0
2 1 130 B B 1.0 2.0
3 2 10 C D 0.0 7.0
4 2 10 A D 0.0 9.0
5 2 10 B D 0.0 13.0
6 2 10 D D 1.0 4.0
7 2 10 D D 1.0 1.5
new
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
And then for count True
s aggregate sum
and for count unique questions
use DataFrameGroupBy.nunique
:
df1 = (df.groupby(['user_id'])
.agg(**{'Questions Attempted':('question_id','nunique'),
'Unique Questions Attempted':('new','sum')})
.reset_index())
Last seems for percentage is necessary divide by number of unique questions:
no_uniq_q = df['correct'].nunique()
df1['Percentage'] = df1['Unique Questions Attempted'].div(no_uniq_q).mul(100)
print (df1)
user_id Questions Attempted Unique Questions Attempted Percentage
0 1 1 1 50.0
1 2 1 1 50.0