I have a a dataframe representing students' activity on exams in certain study groups. A student can be part of multiple groups. They can also answer a question multiple times:
study_group student answer question
s1 a True q1
s1 a False q2
s1 a True q2
s1 b True q2
s1 b True q2
s2 a False q3
s2 a False q4
s2 a True q4
s2 c False q1
s2 d True q2
s3 e True q1
s3 e True q5
s3 c True q2
s3 a True q6
I made various statistics regarding performance of the study groups, comparing them in pairs and evaluating them only on common exams so that it is not biased. But is it possible to see if a student who is part of multiple study groups actually performed better in one than in the other?
I manually sliced the dataframe by student
and then again by study_group
and computed and average for every student in every group. But that was extremely inefficient. I would like to have an output like (I hope I didn't make any computation mistake) :
study_group student Average
s1 a 0.66
s2 a 0.33
s3 a 1
s1 b 1
s2 c 0
s3 c 1
s2 d 1
s3 e 1
Is there any quick way to do it?
CodePudding user response:
>>> g = df.groupby(['student','study_group'])
>>> g.mean().round(2)
student study_group
a s1 0.67
s2 0.33
s3 1.00
b s1 1.00
c s2 0.00
s3 1.00
d s2 1.00
e s3 1.00
Name: answer, dtype: float64
Or, if you want a one-liner:
>>> df.groupby(['student','study_group']).answer.mean().round(2).unstack()
study_group s1 s2 s3
student
a 0.67 0.33 1.0
b 1.00 NaN NaN
c NaN 0.00 1.0
d NaN 1.00 NaN
e NaN NaN 1.0
Incidentally, if you're interested in which students were in how many study groups:
>>> df.groupby(['student','study_group']).size().unstack().count(axis=1)
student
a 3
b 1
c 2
d 1
e 1
dtype: int64