I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : if 0.0 it means users answer was not right , if 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
user_iD | question_id | user_answer | correct answer | correct | elapsed_time | solving_id | bundle_id |
---|---|---|---|---|---|---|---|
1 | 1 | A | D | 0.0 | 5.00 | 1 | 1 |
1 | 1 | B | D | 0.0 | 3.00 | 1 | 1 |
1 | 1 | D | D | 1.0 | 7.00 | 1 | 1 |
2 | 10 | C | B | 0.0 | 5.00 | 10 | 5 |
2 | 10 | B | B | 1.0 | 15.0 | 10 | 5 |
2 | 10 | B | B | 1.0 | 2.00 | 10 | 5 |
3 | 25 | C | A | 0.0 | 5.00 | 25 | 7 |
3 | 25 | A | A | 1.0 | 1.00 | 25 | 7 |
What I want another dataframe to look like is this
user_iD | question_id | user_answer | correct answer | correct | elapsed_time | solving_id | bundle_id |
---|---|---|---|---|---|---|---|
1 | 1 | D | D | 1.0 | 7.00 | 1 | 1 |
2 | 10 | B | B | 1.0 | 15.0 | 10 | 5 |
3 | 25 | A | A | 1.0 | 1.00 | 25 | 7 |
It should remove the duplicate questions attempted by a user and keeps the one which user corrected the very first time.
What I have done so far , this gives me an error
df = df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])
ERROR : AttributeError: 'DataFrame' object has no attribute 'drop_duplicated'
CodePudding user response:
drop_duplicated
wrong typo. use drop_duplicates.
Also, check keep
parameter to preserve row you want.
CodePudding user response:
For your boolean mask, you need duplicated
instead of drop_duplicated
. So instead of
df['correct'].eq(1) & ~df.drop_duplicated(['user_iD','question_id','correct'])
you want
mask = df['correct'].eq(1) & ~df.duplicated(['user_iD','question_id','correct'])
This will give you a boolean mask. Then your desired outcome is obtained using this mask on df
:
out = df[mask]
You can get the same outcome by filtering for correct
first and then using drop_duplicates
method:
out = df[df['correct']==1].drop_duplicates('correct answer'))
Output:
user_iD question_id user_answer correct answer correct elapsed_time \
2 1 1 D D 1.0 7.0
4 2 10 B B 1.0 15.0
7 3 25 A A 1.0 1.0
solving_id bundle_id
2 1 1
4 10 5
7 25 7
CodePudding user response:
groupby and agg are good at conditionally isolating. In this case, filter the ones and find the max in each group of the user_iD
df[df['correct'].eq(1.0)].groupby(df['user_iD']).agg('max')
user_iD question_id user_answer correct answer
correct \
user_iD
1 1 1 D D 1.0
2 2 10 B B 1.0
3 3 25 A A 1.0
elapsed_time solving_id bundle_id
user_iD
1 7.0 1 1
2 15.0 10 5
3 1.0 25 7
CodePudding user response:
Use Series.eq
with Groupby.idxmax
and df.loc
:
In [174]: ix = df[df.correct.eq(1)].groupby(['user_iD', 'question_id'])['elapsed_time'].idxmax()
In [175]: df.loc[ix]
Out[175]:
user_iD question_id user_answer correct answer correct elapsed_time solving_id bundle_id
2 1 1 D D 1.0 7.0 1 1
4 2 10 B B 1.0 15.0 10 5
7 3 25 A A 1.0 1.0 25 7