pandas lookup value in column containing Intervals-CodePudding

I am trying to map a column of continuous float values to some discrete (bucketed) values based on the range into which the continuous value falls

e.g.

df_lookup = pd.DataFrame(data=[[0.0, 0.3, 10.1],
                               [0.3, 0.65, 30.3],
                               [0.65, 1.0, 50.5]], 
                        columns=['start', 'end', 'mapped_value'])
# create intervals
df_lookup['interval'] = df_lookup.apply(lambda x: 
                                             pd.Interval(x['start'], 
                                                         x['end'], 
                                                         closed='both' if x['end']==1.0 else 'left'), axis=1)

df_lookup

Output:

	start	end	mapped_value	interval
0	0.00	0.30	10.1	[0.0, 0.3)
1	0.30	0.65	30.3	[0.3, 0.65)
2	0.65	1.00	50.5	[0.65, 1.0]

df_data=pd.DataFrame(data=[['A', 0.3],
                           ['B', 0.65],
                           ['C', 0.6],
                           ['D', 0.75],
                           ['E', 0.4]], 
                     columns=['ID', 'original_value'])
df_data

	ID	original_value
0	A	0.30
1	B	0.65
2	C	0.60
3	D	0.75
4	E	0.40

At this point I use pandas.DataFrame.apply to get my lookup value but

df_data['mapped_value'] = df_data.apply(
        lambda x: df_lookup.loc[x['original_value'] in df_lookup['interval']]['mapped_value'], 
            axis=1)

But this blows up telling me KeyError: 'False: boolean label can not be used without a boolean index'

Further investigation shows me that the issue I have is that when I do the in I just get a single boolean value returned not a list of booleans, e.g., for data ID='A' where the original value is 0.3, I am hoping that x['original_value'] in df_lookup['interval'] would return [False, True, False] but in fact its returning False

I'd appreciate a little steer here on how to implement this "lookup" mapping. thanks

CodePudding user response：

You can use pandas.merge_asof with an optional check on the upper bound if the intervals are potentially disjoint:

df_data['mapped_value'] = (pd
 .merge_asof(df_data.sort_values(by='original_value'),
             df_lookup,
             left_on='original_value', right_on='start')
 # assign command below is only required if the intervals are disjoint
 .assign(mapped_value=lambda d: d['mapped_value']
                                .mask(d['end'].lt(d['original_value'])))
 ['mapped_value']
)

output:

  ID  original_value  mapped_value
0  A            0.30          30.3
1  B            0.65          30.3
2  C            0.60          30.3
3  D            0.75          50.5
4  E            0.40          50.5

CodePudding user response：

Solution with `Series.map`

There is no need to create an intermediate interval column instead you should create an IntervalIndex which can be used to substitute the values in df_data

i = pd.IntervalIndex.from_arrays(df_lookup.start, df_lookup.end, closed='left')
df_data['mapped_value'] = df_data['original_value'].map(df_lookup.set_index(i)['mapped_value'])

Result

print(df_data)

  ID  original_value  mapped_value
0  A            0.30          30.3
1  B            0.65          50.5
2  C            0.60          30.3
3  D            0.75          50.5
4  E            0.40          30.3

Solution with Series.map

Result

Solution with `Series.map`