I have a dataframe df1
with Date
and ID
as index and the Weight
. I want to set an upper weight limit (30%) of the weights per date. The weights on each day add up to 100% and if I set an upper weight limit, it is the case that the next biggest weight is then bigger than the weight limit of 30%. Is there a way to account for that without doing several iterations? The remaining weight sum which are not bigger than the max weight add up to: 100% - number of max weights reached.
df1:
Date ID Weight
2023-01-30 A 0.45 <-- over max weight of 30%
2023-01-30 B 0.25
2023-01-30 C 0.15
2023-01-30 D 0.10
2023-01-30 E 0.05
2023-01-31 A 0.55
2023-01-31 B 0.25
2023-01-31 C 0.20
2023-01-31 D 0.00
2023-01-31 E 0.00
df1:
Date ID Weight Weight_upper
2023-01-30 A 0.45 0.300 <-- set to max weight
2023-01-30 B 0.25 0.318 <-- bigger than max weight
2023-01-30 C 0.15 0.191
2023-01-30 D 0.10 0.127 (ex calculation: 0.1 * (1 - 0.3)/(0.25 0.15 0.1 0.05)
2023-01-30 E 0.05 0.060
2023-01-31 A 0.55 0.300
2023-01-31 B 0.25 0.389
2023-01-31 C 0.20 0.311
2023-01-31 D 0.00 0.000
2023-01-31 E 0.00 0.000
For reproducibility:
df = pd.DataFrame({
'Date':['2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-31', '2023-01-31', '2023-01-31', '2023-01-31', '2023-01-31'],
'ID':['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'],
'Weight':[0.45, 0.25, 0.15, 0.1, 0.05, 0.55, 0.25, 0.2, 0, 0]})
df.set_index('Date')
Many thanks for your help!
CodePudding user response:
The logic is unclear, so I'll assume you want to allocate the fraction of the weights above the desired max (0.3) to the other weights in a way that doesn't cause any other weight to become above threshold.
I would compute the difference to the threshold, then split the values into above/below and allocate the extra weight proportionally to the available space for each weight below threshold:
max_weight = 0.3
df2 = df.assign(diff=df['Weight'].sub(max_weight),
mask=lambda d: d['diff'].gt(0),
above=lambda d: d['diff'].where(d['mask']),
below=lambda d: d['diff'].mask(d['mask']),
)
g = df2.groupby('Date')
df['Weight_upper'] = (df2['below']
.div(g['below'].transform('sum'))
.mul(g['above'].transform('sum'))
.add(df['Weight'])
.fillna(max_weight)
)
print(df)
Output:
Date ID Weight Weight_upper
0 2023-01-30 A 0.45 0.300000
1 2023-01-30 B 0.25 0.261538
2 2023-01-30 C 0.15 0.184615
3 2023-01-30 D 0.10 0.146154
4 2023-01-30 E 0.05 0.107692
5 2023-01-31 A 0.55 0.300000
6 2023-01-31 B 0.25 0.266667
7 2023-01-31 C 0.20 0.233333
8 2023-01-31 D 0.00 0.100000
9 2023-01-31 E 0.00 0.100000
Intermediates:
Date ID Weight diff mask above below Weight_upper
0 2023-01-30 A 0.45 0.15 True 0.15 NaN 0.300000
1 2023-01-30 B 0.25 -0.05 False NaN -0.05 0.261538
2 2023-01-30 C 0.15 -0.15 False NaN -0.15 0.184615
3 2023-01-30 D 0.10 -0.20 False NaN -0.20 0.146154
4 2023-01-30 E 0.05 -0.25 False NaN -0.25 0.107692
5 2023-01-31 A 0.55 0.25 True 0.25 NaN 0.300000
6 2023-01-31 B 0.25 -0.05 False NaN -0.05 0.266667
7 2023-01-31 C 0.20 -0.10 False NaN -0.10 0.233333
8 2023-01-31 D 0.00 -0.30 False NaN -0.30 0.100000
9 2023-01-31 E 0.00 -0.30 False NaN -0.30 0.100000