I have a dataset,df
that looks as follows:
Date | Code | Population | QTY |
---|---|---|---|
2021-01-01 | 1000 | 10000 | 5000 |
2021-01-02 | 1000 | 10000 | 5013 |
2021-01-01 | 2200 | 33000 | 10022 |
2021-01-02 | 2200 | 33000 | 10002 |
I would like to take the daily weighted average of my dataframe, where my weight is my Population
, and store it in a new dataset, df_avg
that contains only Date
and WEIGHTED QTY
. How can I do that?
CodePudding user response:
There are two ways: 1) groupby.apply
with np.average
and 2) manually calculate with lazy groupby.
Path 1 is easier to code, but a little slower:
df.groupby('Date')\
.apply(lambda x: np.average(x['QTY'], weights=x['Population']))\
.reset_index(name='avg')
Path 2 is faster, but more verbose:
df['prods'] = df['QTY'] * df['Population']
groups = df.groupby('Date')
groups['prods'].sum().div(groups['Population'].sum()).reset_index(name='avg')
Either way gives you the output:
Date avg
0 2021-01-01 8854.093023
1 2021-01-02 8841.767442
Note, an alternative to Path 2 without adding new column to data involves groupby twice:
df['QTY'].mul(df['Population']).groupby(df['Date']).sum()\
.div(df.groupby('Date')['Population'].sum())\
.reset_index(name='avg')
Another option, involving the pipe method:
(df
.assign(weighted_qty = df.Population * df.QTY)
.groupby('Date')
.pipe(lambda df: df.weighted_qty.sum()/df.Population.sum())
)
Date
2021-01-01 8854.093023
2021-01-02 8841.767442
dtype: float64