Find standard deviation of a column based of values from another column and group by-CodePudding

I have a data frame looking like this:

classid  grade  haveTeacher
0        99     1
1        40     1
1        50     0
1        70     1
2        50     0
3        34     0

I'd like to find out what I could write in pandas to find out the standard deviation of "grade" across classid that have a teacher (1 means there is a teacher). I know we would have to groupby "classid", but I was wondering what would go inside the .apply and lambda function to fulfill all these conditionals?

CodePudding user response：

For improve performance first set missing values if no 1 in haveTeacher by Series.where and then aggregate std:

df = (df['grade'].where(df['haveTeacher'].eq(1))
                 .groupby(df['classid'])
                 .std()
                 .reset_index(name='std'))
print (df)
   classid        std
0        0        NaN
1        1  21.213203
2        2        NaN
3        3        NaN

Solution suggested in question should be slow if large DataFrame:

df = (df.groupby('classid')
        .apply(lambda x: x.loc[x['haveTeacher'].eq(1), 'grade'].std())
        .reset_index(name='std'))
print (df)
   classid        std
0        0        NaN
1        1  21.213203
2        2        NaN
3        3        NaN

CodePudding user response：

You might first want to get the dataframe with records having teacher - df[df['haveteacher'] == 1]. Once you get this you can do a groupby(classid) and use numpy.std (import numpy as np before that ) function to find the standard devitation of that group so you have -

>>> df[df['haveteacher'] == 1].groupby(['classid']).agg({'grade': np.std})

output is -

grade
classid           
0              NaN
1        21.213203