I'm trying to find the weighted median of a column based on multiple columns. Here is an example:
Date | Item | BetterPrice | TotalCost | Location
-------------------------------------------------------------------------
2022-03-01 | AB | 0 | 200 | 3
2022-03-01 | AB | 0 | 200 | 2
2022-03-01 | AB | 1 | 300 | 3
2022-03-01 | AC | 1 | 400 | 2
2022-04-01 | AB | 1 | 400 | 1
2022-04-01 | AC | 1 | 100 | 3
2022-04-01 | AC | 0 | 50 | 1
I want to find weighted median of the Location
column and I want to use the TotalCost
column as weights. I want to use aggregate TOO since I want to find the sum of the BetterPrice column as well.
Originally I tried to used the package wquantiles
with the weighted.median
function, and to do the above tasks I tried something similar to the following code:
import wquantiles
wm = lambda x: weighted.median(x , TotalCost)
df2 = df.groupby(['Date', 'Item']).agg({'BetterPrice': 'sum', "Location": wm}).reset_index()
Unfortunately, this doesn't seem to work, so I thought I should try to ask how best to accomplish this task, thank you!
Also, here is a dataframe to replicate my above problem (hopefully it is detailed enough)
import pandas as pd
data={'Date':['2022-03-01','2022-03-01','2022-03-01','2022-03-01', '2022-04-01', '2022-04-01', '2022-04-01'],'Item':['AB','AB','AB','AC', 'AB', 'AC', 'AC'],'BetterPrice':[0,0,1,1, 1, 1, 0],'TotalCost':[200,200,300,400, 400, 100, 50],'Location':[3,2,3,2,1,3, 1]}
df=pd.DataFrame(data)
CodePudding user response:
Here is a way to do it. To get the weighted median, you can get the np.median
of the np.repeat
values in Location with TotalCost. Do this per group. Then concat
the result with the sum on the other column per group as well.
gr = df.groupby(['Date', 'Item'])
res = pd.concat(
[gr.apply(lambda x: np.median(np.repeat(x.Location, x.TotalCost))),
gr['BetterPrice'].sum()],
keys=['Loc_weigthed_median','BetterPrice_sum'],
axis=1
).reset_index()
print(res)
# Date Item Loc_weighted_median BetterPrice_sum
# 0 2022-03-01 AB 3.0 1
# 1 2022-03-01 AC 2.0 1
# 2 2022-04-01 AB 1.0 1
# 3 2022-04-01 AC 3.0 1