Home > Net >  Double slicing a dataframe
Double slicing a dataframe

Time:10-28

I have a a dataframe representing students' activity on exams in certain study groups. A student can be part of multiple groups. They can also answer a question multiple times:

study_group     student    answer    question
s1              a          True      q1
s1              a          False     q2
s1              a          True      q2
s1              b          True      q2
s1              b          True      q2
s2              a          False     q3
s2              a          False     q4
s2              a          True      q4
s2              c          False     q1
s2              d          True      q2
s3              e          True      q1
s3              e          True      q5
s3              c          True      q2
s3              a          True      q6

I made various statistics regarding performance of the study groups, comparing them in pairs and evaluating them only on common exams so that it is not biased. But is it possible to see if a student who is part of multiple study groups actually performed better in one than in the other?

I manually sliced the dataframe by student and then again by study_group and computed and average for every student in every group. But that was extremely inefficient. I would like to have an output like (I hope I didn't make any computation mistake) :

study_group      student  Average
s1               a           0.66
s2               a           0.33
s3               a              1
s1               b              1
s2               c              0 
s3               c              1
s2               d              1
s3               e              1

Is there any quick way to do it?

CodePudding user response:

>>> g = df.groupby(['student','study_group'])
>>> g.mean().round(2)
student  study_group
a        s1             0.67
         s2             0.33
         s3             1.00
b        s1             1.00
c        s2             0.00
         s3             1.00
d        s2             1.00
e        s3             1.00
Name: answer, dtype: float64

Or, if you want a one-liner:

>>> df.groupby(['student','study_group']).answer.mean().round(2).unstack()
study_group    s1    s2   s3
student                     
a            0.67  0.33  1.0
b            1.00   NaN  NaN
c             NaN  0.00  1.0
d             NaN  1.00  NaN
e             NaN   NaN  1.0

Incidentally, if you're interested in which students were in how many study groups:

>>> df.groupby(['student','study_group']).size().unstack().count(axis=1)
student
a    3
b    1
c    2
d    1
e    1
dtype: int64
  • Related