I have a data frame looking like this:
classid grade haveTeacher
0 99 1
1 40 1
1 50 0
1 70 1
2 50 0
3 34 0
I'd like to find out what I could write in pandas to find out the standard deviation of "grade" across classid that have a teacher (1 means there is a teacher). I know we would have to groupby "classid", but I was wondering what would go inside the .apply and lambda function to fulfill all these conditionals?
CodePudding user response:
For improve performance first set missing values if no 1
in haveTeacher
by Series.where
and then aggregate std
:
df = (df['grade'].where(df['haveTeacher'].eq(1))
.groupby(df['classid'])
.std()
.reset_index(name='std'))
print (df)
classid std
0 0 NaN
1 1 21.213203
2 2 NaN
3 3 NaN
Solution suggested in question should be slow if large DataFrame:
df = (df.groupby('classid')
.apply(lambda x: x.loc[x['haveTeacher'].eq(1), 'grade'].std())
.reset_index(name='std'))
print (df)
classid std
0 0 NaN
1 1 21.213203
2 2 NaN
3 3 NaN
CodePudding user response:
You might first want to get the dataframe with records having teacher - df[df['haveteacher'] == 1]
. Once you get this you can do a groupby(classid)
and use numpy.std (import numpy as np
before that ) function to find the standard devitation of that group
so you have -
>>> df[df['haveteacher'] == 1].groupby(['classid']).agg({'grade': np.std})
output is -
grade
classid
0 NaN
1 21.213203