Home > Enterprise >  How to take a sum (in denominator) for calculating group by weighted average in a dataframe?
How to take a sum (in denominator) for calculating group by weighted average in a dataframe?

Time:12-04

I have a data frame that looks like this.

import pandas as pd
import numpy as np

data = [
  ['A',1,2,3,4],
  ['A',5,6,7,8],
  ['A',9,10,11,12],
  ['B',13,14,15,16],
  ['B',17,18,19,20],
  ['B',21,22,23,24],
  ['B',25,26,27,28],
  ['C',29,30,31,32],
  ['C',33,34,35,36],
  ['C',37,38,39,40],
  ['D',13,14,15,0],
  ['D',0,18,19,0],
  ['D',0,0,23,0],
  ['D',0,0,0,0],
  ['E',13,14,15,0],
  ['E',0,18,19,0],
  ['F',0,0,23,0],

]

df = pd.DataFrame(data, columns=['Name', 'num1', 'num2', 'num3', 'num4'])
df

Then I have the following code to calculate the group by weighted average.

weights = [10,20,30,40] 

df=df.groupby('Name').agg(lambda g: sum(g*weights[:len(g)])/sum(weights[:len(g)]))

The problem lies in sum(weights[:len(g)]) because all the groups do not have equal rows. As you can see above, group A has 3 rows, B has 4 rows, C has 3 rows, D has 4 rows, E has 2 rows and F has 1 row. Depending upon the rows, it needs to calculate the sum. Now, the above code returns me the weighted average by calculating

For Group A, the first column calculates the weighted average as (1 X 10 5 X 20 9 X 30)/60 but it should calculate the weighted average as (1 X20 5 X 30 9 X 40)/90

For Group E, the first column calculates the weighted average as (13 X 10 0 X 20)/30 but it should calculate the weighted average as (13 X 30 0 X 40)/70

Current Result

enter image description here

Expected result

enter image description here

CodePudding user response:

i edit your code little bit

n = len(weights)
df=df.groupby('Name').agg(lambda g: sum(g*weights[n-len(g):])/sum(weights[n-len(g):]))

output(df):

    num1    num2    num3    num4
Name                
A   5.9     6.9     7.9     8.9
B   21.0    22.0    23.0    24.0
C   33.9    34.9    35.9    36.9
D   1.3     5.0     12.2    0.0
E   5.6     16.3    17.3    0.0
F   0.0     0.0     23.0    0.0

CodePudding user response:

@PandaKim's solution suffices; for efficiency, depending on your data size, you may have to take a longer route:

n = len(weights)
pos = n - df.groupby('Name').size()
pos = [weights[posn : n] for posn in pos]
pos = np.concatenate(pos)
(df
.set_index('Name')
.mul(pos, axis=0)
.assign(wt = pos)
.groupby('Name')
.sum()
.pipe(lambda df: df.filter(like='num')
                   .div(df.wt, axis=0)
     )
)

           num1       num2       num3       num4
Name
A      5.888889   6.888889   7.888889   8.888889
B     21.000000  22.000000  23.000000  24.000000
C     33.888889  34.888889  35.888889  36.888889
D      1.300000   5.000000  12.200000   0.000000
E      5.571429  16.285714  17.285714   0.000000
F      0.000000   0.000000  23.000000   0.000000
  • Related