df1:
Variables left right
0 AUM -0.001 28.20
1 AUM 28.20 40.28
2 AUM 40.28 58.27
3 AUM 58.27 80.72
4 AUM 80.72 100.00
0 ABS -88.01 200.72
1 ABS 200.72 480.72
2 ABS 480.72 800.20
0 LS 10000 200000
1 LS 200000 400000
df2:
Pan_no ABS AUM LS
0 AAA 28 30 10001
2 CCC 500 98 390000
1 BBB 250 50 150000
3 DDD 100 60 380000
4 EEE 88 10 378347
Conditions:
Based on the left and right value in df1, a new column should be created in df2 and the values inside the column should be the index of df1 for that particular variable.
Example: In df2 if AUM value falls under this range (-0.001 - 28.20) then the new column will have the index value of df1 as the new value. i.e 0
Similarly,
In df2 if ABS value falls under this range (200.72 - 480.72) then the new column ABS_BIN will have the index value of df1 as the new value. i.e 1
WHat i have tried is:
binning_vars = ['ABS','AUM','LS']
def f(row):
for i in binning_vars:
for j in df1[df1['Variable'] == i].index:
if df1[i] >= df1['left'] & df1[i] >= df1['right']:
value = j
else:
pass
return value
df2[i,'_bin'] = df1.apply(f, axis=1)
but it throws an error as TypeError: unsupported operand type(s) for &: 'float' and 'float'. Any help will be much appreciated.
Expected Output:
with new columns in df2:
Pan_no ABS AUM LS ABS_BIN AUM_BIN LS_BIN
0 AAA 28 30 10001 0 1 0
1 BBB 250 50 150000 1 2 0
2 CCC 500 98 390000 2 4 1
3 DDD 100 60 380000 0 3 1
4 EEE 88 10 378347 0 0 1
CodePudding user response:
You can use merge_asof
to avoid using apply
:
out = df2.merge(
pd.merge_asof((df2.melt(id_vars='Pan_no')
.astype({'value': float})
.sort_values(by='value')
),
df1.reset_index().sort_values(by='left'),
left_by='variable', right_by='Variables',
left_on='value', right_on='left', direction='backward')
.pivot(index='Pan_no', columns='variable', values='index')
.add_suffix('_BIN'),
left_on='Pan_no', right_index=True
)
output:
Pan_no ABS AUM LS ABS_BIN AUM_BIN LS_BIN
0 AAA 28 30 10001 0 1 0
2 CCC 500 98 390000 2 4 1
1 BBB 250 50 150000 1 2 0
3 DDD 100 60 380000 0 3 1
4 EEE 88 10 378347 0 0 1
CodePudding user response:
You can use pd.cut
and avoid loops inside the binning
function:
def binning(sr):
df = df1.loc[df1['Variables'] == sr.name, ['left', 'right']]
bins = sorted(set(df.to_numpy().ravel()))
return pd.cut(sr, bins=bins, labels=df.index)
out = df2[binning_vars].apply(binning).add_suffix('_BIN')
df2 = pd.concat([df2, out], axis=1)
Output:
>>> df2
Pan_no ABS AUM LS ABS_BIN AUM_BIN LS_BIN
0 AAA 28 30 10001 0 1 0
2 CCC 500 98 390000 2 4 1
1 BBB 250 50 150000 1 2 0
3 DDD 100 60 380000 0 3 1
4 EEE 88 10 378347 0 0 1