I am trying to map a column of continuous float values to some discrete (bucketed) values based on the range into which the continuous value falls
e.g.
df_lookup = pd.DataFrame(data=[[0.0, 0.3, 10.1],
[0.3, 0.65, 30.3],
[0.65, 1.0, 50.5]],
columns=['start', 'end', 'mapped_value'])
# create intervals
df_lookup['interval'] = df_lookup.apply(lambda x:
pd.Interval(x['start'],
x['end'],
closed='both' if x['end']==1.0 else 'left'), axis=1)
df_lookup
Output:
start | end | mapped_value | interval | |
---|---|---|---|---|
0 | 0.00 | 0.30 | 10.1 | [0.0, 0.3) |
1 | 0.30 | 0.65 | 30.3 | [0.3, 0.65) |
2 | 0.65 | 1.00 | 50.5 | [0.65, 1.0] |
df_data=pd.DataFrame(data=[['A', 0.3],
['B', 0.65],
['C', 0.6],
['D', 0.75],
['E', 0.4]],
columns=['ID', 'original_value'])
df_data
ID | original_value | |
---|---|---|
0 | A | 0.30 |
1 | B | 0.65 |
2 | C | 0.60 |
3 | D | 0.75 |
4 | E | 0.40 |
At this point I use pandas.DataFrame.apply
to get my lookup value but
df_data['mapped_value'] = df_data.apply(
lambda x: df_lookup.loc[x['original_value'] in df_lookup['interval']]['mapped_value'],
axis=1)
But this blows up telling me KeyError: 'False: boolean label can not be used without a boolean index'
Further investigation shows me that the issue I have is that when I do the in
I just get a single boolean value returned not a list of booleans, e.g., for data ID='A'
where the original value
is 0.3, I am hoping that x['original_value'] in df_lookup['interval']
would return [False, True, False]
but in fact its returning False
I'd appreciate a little steer here on how to implement this "lookup" mapping. thanks
CodePudding user response:
You can use pandas.merge_asof
with an optional check on the upper bound if the intervals are potentially disjoint:
df_data['mapped_value'] = (pd
.merge_asof(df_data.sort_values(by='original_value'),
df_lookup,
left_on='original_value', right_on='start')
# assign command below is only required if the intervals are disjoint
.assign(mapped_value=lambda d: d['mapped_value']
.mask(d['end'].lt(d['original_value'])))
['mapped_value']
)
output:
ID original_value mapped_value
0 A 0.30 30.3
1 B 0.65 30.3
2 C 0.60 30.3
3 D 0.75 50.5
4 E 0.40 50.5
CodePudding user response:
Solution with Series.map
There is no need to create an intermediate interval
column instead you should create an IntervalIndex
which can be used to substitute the values in df_data
i = pd.IntervalIndex.from_arrays(df_lookup.start, df_lookup.end, closed='left')
df_data['mapped_value'] = df_data['original_value'].map(df_lookup.set_index(i)['mapped_value'])
Result
print(df_data)
ID original_value mapped_value
0 A 0.30 30.3
1 B 0.65 50.5
2 C 0.60 30.3
3 D 0.75 50.5
4 E 0.40 30.3