Say I have this dummy panda's df:
Feature1 Featrue2
0 X 0
1 X 0
2 Y 0
3 Y 1
4 Y 1
5 X 1
6 Y 0
7 X 1
8 Y 1
9 X 0
How do I calculate the average of Feature2
, only when the value of Feature1
is X, and the average of Feature2
again, just when the value of Feature1
is Y? I figure it's by using groupby
, however it's not working for me.
My attempt (making a function to find the difference in the two averages):
def diff_of_avg(df, column_name , groupby_var):
groupby_var = df.groupby(groupby_var)
avgs = groupby_var[column_name].mean()
return avgs.loc['1'] - avgs.loc['0']
where groupby_var
is Feature2
and column_name
is Feature1
CodePudding user response:
You can indeed use groupby()
:
df2 = df.groupby('Feature1').mean()
Ouput:
Featrue2
Feature1
X 0.4
Y 0.6
Docs for mean()
give some examples as well.
To find the difference in the averages of X
and Y
, you can do this:
diffOfAverages = df.groupby('Feature1').mean().diff().iloc[-1,-1]
Output:
0.19999999999999996