Home > Net >  python pandas weighted average with the use of groupby agg()
python pandas weighted average with the use of groupby agg()

Time:04-16

I want the ability to use custom functions in pandas groupby agg(). I Know there is the option of using apply but doing several aggregations is what I want. Below is my test code that I tried to get working for the weighted average.

Python Code

import pandas as pd
import numpy as np

def weighted_avg(df, values, weights):
    '''To calculate a weighted average in Pandas. Demo see https://www.statology.org/pandas-weighted-average/
    Example: df.groupby('Group Names').apply(w_avg, 'Results', 'AFY')'''
    v = df[values]
    w = df[weights]
    return (v * w).sum() / w.sum()

# below creates a dataframe.
dfr = pd.DataFrame(np.random.randint(1,50,size=(4,4)), columns=list('ABCD'))
dfr['group'] = [1, 1, 0, 1]

print(dfr)
dfr = dfr.groupby('group').agg({'A':'mean', 'B':'sum',
                    'C': lambda x: weighted_avg(dfr, 'D', 'C')}).reset_index()
print(dfr)

Results - Output

    A   B   C   D  group
0   5   2  17  38      1
1  35  30  22  32      1
2  15  18  16  11      0
3  46   6  20  34      1
    group     A      B       C
0      0  15.000000  18  29.413333
1      1  28.666667  38  29.413333

The problem: The weighted average is returning the value for the whole table and not the 'group' column. How can I get the weighted average by group working?

I did try placing the groupby inside the function like shown here but no success. Thank you for taking a look.

CodePudding user response:

You can use x you have in lambda (specifically, use it's .index to get values you want). For example:

import pandas as pd
import numpy as np


def weighted_avg(group_df, whole_df, values, weights):
    v = whole_df.loc[group_df.index, values]
    w = whole_df.loc[group_df.index, weights]
    return (v * w).sum() / w.sum()


dfr = pd.DataFrame(np.random.randint(1, 50, size=(4, 4)), columns=list("ABCD"))
dfr["group"] = [1, 1, 0, 1]

print(dfr)
dfr = (
    dfr.groupby("group")
    .agg(
        {"A": "mean", "B": "sum", "C": lambda x: weighted_avg(x, dfr, "D", "C")}
    )
    .reset_index()
)
print(dfr)

Prints:

    A   B   C   D  group
0  32   2  34  29      1
1  33  32  15  49      1
2   4  43  41  10      0
3  39  33   7  31      1

   group          A   B          C
0      0   4.000000  43  10.000000
1      1  34.666667  67  34.607143

CodePudding user response:

Where you have written lambda x: weighted_avg(dfr, 'D', 'C')

this will calculate the weighted average over dfr, i.e. the whole table.

If you change it to lambda group: weighted_avg(group, "D", "C")

then I think it may work.

(I've changed the name of the lambda variable to group since x is not very descriptive)

  • Related