Home > Mobile >  pandas lookup value in column containing Intervals
pandas lookup value in column containing Intervals

Time:06-03

I am trying to map a column of continuous float values to some discrete (bucketed) values based on the range into which the continuous value falls

e.g.

df_lookup = pd.DataFrame(data=[[0.0, 0.3, 10.1],
                               [0.3, 0.65, 30.3],
                               [0.65, 1.0, 50.5]], 
                        columns=['start', 'end', 'mapped_value'])
# create intervals
df_lookup['interval'] = df_lookup.apply(lambda x: 
                                             pd.Interval(x['start'], 
                                                         x['end'], 
                                                         closed='both' if x['end']==1.0 else 'left'), axis=1)

df_lookup

Output:

start end mapped_value interval
0 0.00 0.30 10.1 [0.0, 0.3)
1 0.30 0.65 30.3 [0.3, 0.65)
2 0.65 1.00 50.5 [0.65, 1.0]
df_data=pd.DataFrame(data=[['A', 0.3],
                           ['B', 0.65],
                           ['C', 0.6],
                           ['D', 0.75],
                           ['E', 0.4]], 
                     columns=['ID', 'original_value'])
df_data
ID original_value
0 A 0.30
1 B 0.65
2 C 0.60
3 D 0.75
4 E 0.40

At this point I use pandas.DataFrame.apply to get my lookup value but

df_data['mapped_value'] = df_data.apply(
        lambda x: df_lookup.loc[x['original_value'] in df_lookup['interval']]['mapped_value'], 
            axis=1)

But this blows up telling me KeyError: 'False: boolean label can not be used without a boolean index'

Further investigation shows me that the issue I have is that when I do the in I just get a single boolean value returned not a list of booleans, e.g., for data ID='A' where the original value is 0.3, I am hoping that x['original_value'] in df_lookup['interval'] would return [False, True, False] but in fact its returning False

I'd appreciate a little steer here on how to implement this "lookup" mapping. thanks

CodePudding user response:

You can use pandas.merge_asof with an optional check on the upper bound if the intervals are potentially disjoint:

df_data['mapped_value'] = (pd
 .merge_asof(df_data.sort_values(by='original_value'),
             df_lookup,
             left_on='original_value', right_on='start')
 # assign command below is only required if the intervals are disjoint
 .assign(mapped_value=lambda d: d['mapped_value']
                                .mask(d['end'].lt(d['original_value'])))
 ['mapped_value']
)

output:

  ID  original_value  mapped_value
0  A            0.30          30.3
1  B            0.65          30.3
2  C            0.60          30.3
3  D            0.75          50.5
4  E            0.40          50.5

CodePudding user response:

Solution with Series.map

There is no need to create an intermediate interval column instead you should create an IntervalIndex which can be used to substitute the values in df_data

i = pd.IntervalIndex.from_arrays(df_lookup.start, df_lookup.end, closed='left')
df_data['mapped_value'] = df_data['original_value'].map(df_lookup.set_index(i)['mapped_value'])

Result

print(df_data)

  ID  original_value  mapped_value
0  A            0.30          30.3
1  B            0.65          50.5
2  C            0.60          30.3
3  D            0.75          50.5
4  E            0.40          30.3
  • Related