I want the ability to use custom functions in pandas groupby agg(). I Know there is the option of using apply but doing several aggregations is what I want. Below is my test code that I tried to get working for the weighted average.
Python Code
import pandas as pd
import numpy as np
def weighted_avg(df, values, weights):
'''To calculate a weighted average in Pandas. Demo see https://www.statology.org/pandas-weighted-average/
Example: df.groupby('Group Names').apply(w_avg, 'Results', 'AFY')'''
v = df[values]
w = df[weights]
return (v * w).sum() / w.sum()
# below creates a dataframe.
dfr = pd.DataFrame(np.random.randint(1,50,size=(4,4)), columns=list('ABCD'))
dfr['group'] = [1, 1, 0, 1]
print(dfr)
dfr = dfr.groupby('group').agg({'A':'mean', 'B':'sum',
'C': lambda x: weighted_avg(dfr, 'D', 'C')}).reset_index()
print(dfr)
Results - Output
A B C D group
0 5 2 17 38 1
1 35 30 22 32 1
2 15 18 16 11 0
3 46 6 20 34 1
group A B C
0 0 15.000000 18 29.413333
1 1 28.666667 38 29.413333
The problem: The weighted average is returning the value for the whole table and not the 'group' column. How can I get the weighted average by group working?
I did try placing the groupby inside the function like shown here but no success. Thank you for taking a look.
CodePudding user response:
You can use x
you have in lambda (specifically, use it's .index
to get values you want). For example:
import pandas as pd
import numpy as np
def weighted_avg(group_df, whole_df, values, weights):
v = whole_df.loc[group_df.index, values]
w = whole_df.loc[group_df.index, weights]
return (v * w).sum() / w.sum()
dfr = pd.DataFrame(np.random.randint(1, 50, size=(4, 4)), columns=list("ABCD"))
dfr["group"] = [1, 1, 0, 1]
print(dfr)
dfr = (
dfr.groupby("group")
.agg(
{"A": "mean", "B": "sum", "C": lambda x: weighted_avg(x, dfr, "D", "C")}
)
.reset_index()
)
print(dfr)
Prints:
A B C D group
0 32 2 34 29 1
1 33 32 15 49 1
2 4 43 41 10 0
3 39 33 7 31 1
group A B C
0 0 4.000000 43 10.000000
1 1 34.666667 67 34.607143
CodePudding user response:
Where you have written
lambda x: weighted_avg(dfr, 'D', 'C')
this will calculate the weighted average over dfr
, i.e. the whole table.
If you change it to
lambda group: weighted_avg(group, "D", "C")
then I think it may work.
(I've changed the name of the lambda variable to group
since x
is not very descriptive)