I'm trying to calculate a simple weighted average over multiple columns at once grouped by an index column with some of the column values being NaN.
Below is a sample dataset:
df = pd.DataFrame(np.random.choice([0,1,np.nan], size=(5,5)), columns=list('ABCDE'))
df['F'] = 'XYZ'
df['weight'] = 5 np.random.sample(5) * 5
A B C D E F weight
0.0 0.0 0.0 NaN NaN XYZ 7.754209
0.0 1.0 1.0 0.0 0.0 XYZ 5.811653
0.0 NaN 1.0 0.0 1.0 XYZ 7.858809
1.0 0.0 1.0 0.0 1.0 XYZ 7.690689
NaN 1.0 0.0 0.0 0.0 XYZ 5.092012
And below is my attempt to calculate the weighted average by excluding the NaN values from both the columns as well as weights.
def weighted_avg(df, index_col, weight_col):
cols = [c for c in df.columns if c not in ['weight_col', 'index_col']]
df.loc[:, cols] = df.loc[:, cols].mul(df[weight_col], axis=0)
agg = df.groupby(df[index_col]).sum()
agg.loc[:, cols] = agg.loc[:, cols].div(agg[weight_col], axis=0)
return agg
weighted_avg(df, 'F', 'weight')
However, I'm getting this error:
TypeError: can't multiply sequence by non-int of type 'float'
I checked the data types and all columns except F
are float
.
This is the expected output:
F A B C D E
XYZ 0.26414542 0.413823896 0.624460453 0 0.587812429
I'm relatively new to Python and did find similar problems while extensively searching SO but still couldn't make this simple solution work.
Any help would be much appreciated. TIA.
CodePudding user response:
Simplified solution
x = df.drop(columns=['F', 'weight']) # x values
w = x.notna().mul(df['weight'], axis=0) # weights excluding nulls
wx = w * x # weights * x values
avg = wx.groupby(df['F']).sum() / w.groupby(df['F']).sum() # sum(w * x) / sum(w)
Explained
Drop the index and weight columns to get x values
# x
A B C D E
0 0.0 0.0 0.0 NaN NaN
1 0.0 1.0 1.0 0.0 0.0
2 0.0 NaN 1.0 0.0 1.0
3 1.0 0.0 1.0 0.0 1.0
4 NaN 1.0 0.0 0.0 0.0
Create a boolean mask using notna
then multiply by weights along axis 0
to project the weights values to each column
# w
A B C D E
0 7.754209 7.754209 7.754209 0.000000 0.000000
1 5.811653 5.811653 5.811653 5.811653 5.811653
2 7.858809 0.000000 7.858809 7.858809 7.858809
3 7.690689 7.690689 7.690689 7.690689 7.690689
4 0.000000 5.092012 5.092012 5.092012 5.092012
Multiple the x
values by weights w
# wx
A B C D E
0 0.000000 0.000000 0.000000 NaN NaN
1 0.000000 5.811653 5.811653 0.0 0.000000
2 0.000000 NaN 7.858809 0.0 7.858809
3 7.690689 0.000000 7.690689 0.0 7.690689
4 NaN 5.092012 0.000000 0.0 0.000000
Group the wx
and w
dataframe by index column F
and aggregate with sum
# wx.groupby(df['F']).sum()
A B C D E
F
XYZ 7.690689 10.903665 21.361151 0.0 15.549498
# w.groupby(df['F']).sum()
A B C D E
F
XYZ 29.11536 26.348563 34.207372 26.453163 26.453163
Divide the aggregated sums to calculate weighted average
# avg
A B C D E
F
XYZ 0.264145 0.413824 0.62446 0.0 0.587812