Home > database >  Calculate weighted average for multiple columns with NaN values grouped by index in Python
Calculate weighted average for multiple columns with NaN values grouped by index in Python

Time:08-11

I'm trying to calculate a simple weighted average over multiple columns at once grouped by an index column with some of the column values being NaN.

Below is a sample dataset:

df = pd.DataFrame(np.random.choice([0,1,np.nan], size=(5,5)), columns=list('ABCDE'))
df['F'] = 'XYZ'
df['weight'] = 5   np.random.sample(5) * 5
A   B   C   D   E   F   weight
0.0 0.0 0.0 NaN NaN XYZ 7.754209
0.0 1.0 1.0 0.0 0.0 XYZ 5.811653
0.0 NaN 1.0 0.0 1.0 XYZ 7.858809
1.0 0.0 1.0 0.0 1.0 XYZ 7.690689
NaN 1.0 0.0 0.0 0.0 XYZ 5.092012

And below is my attempt to calculate the weighted average by excluding the NaN values from both the columns as well as weights.

def weighted_avg(df, index_col, weight_col):
    cols = [c for c in df.columns if c not in ['weight_col', 'index_col']]
    df.loc[:, cols] = df.loc[:, cols].mul(df[weight_col], axis=0)
    agg = df.groupby(df[index_col]).sum()
    agg.loc[:, cols] = agg.loc[:, cols].div(agg[weight_col], axis=0)
    return agg

weighted_avg(df, 'F', 'weight')

However, I'm getting this error:

TypeError: can't multiply sequence by non-int of type 'float'

I checked the data types and all columns except F are float.

This is the expected output:

F   A           B           C           D   E
XYZ 0.26414542  0.413823896 0.624460453 0   0.587812429

I'm relatively new to Python and did find similar problems while extensively searching SO but still couldn't make this simple solution work.

Any help would be much appreciated. TIA.

CodePudding user response:

Simplified solution

x = df.drop(columns=['F', 'weight']) # x values
w = x.notna().mul(df['weight'], axis=0) # weights excluding nulls

wx = w * x # weights * x values
avg = wx.groupby(df['F']).sum() / w.groupby(df['F']).sum() # sum(w * x) / sum(w)

Explained

Drop the index and weight columns to get x values

# x
     A    B    C    D    E
0  0.0  0.0  0.0  NaN  NaN
1  0.0  1.0  1.0  0.0  0.0
2  0.0  NaN  1.0  0.0  1.0
3  1.0  0.0  1.0  0.0  1.0
4  NaN  1.0  0.0  0.0  0.0

Create a boolean mask using notna then multiply by weights along axis 0 to project the weights values to each column

# w
          A         B         C         D         E
0  7.754209  7.754209  7.754209  0.000000  0.000000
1  5.811653  5.811653  5.811653  5.811653  5.811653
2  7.858809  0.000000  7.858809  7.858809  7.858809
3  7.690689  7.690689  7.690689  7.690689  7.690689
4  0.000000  5.092012  5.092012  5.092012  5.092012

Multiple the x values by weights w

# wx
          A         B         C    D         E
0  0.000000  0.000000  0.000000  NaN       NaN
1  0.000000  5.811653  5.811653  0.0  0.000000
2  0.000000       NaN  7.858809  0.0  7.858809
3  7.690689  0.000000  7.690689  0.0  7.690689
4       NaN  5.092012  0.000000  0.0  0.000000

Group the wx and w dataframe by index column F and aggregate with sum

# wx.groupby(df['F']).sum()
            A          B          C    D          E
F                                                  
XYZ  7.690689  10.903665  21.361151  0.0  15.549498


# w.groupby(df['F']).sum()
            A          B          C          D          E
F                                                        
XYZ  29.11536  26.348563  34.207372  26.453163  26.453163

Divide the aggregated sums to calculate weighted average

# avg
            A         B        C    D         E
F                                              
XYZ  0.264145  0.413824  0.62446  0.0  0.587812
  • Related