Home > Net >  Python Multi-Criteria Lookup From one of Many Columns
Python Multi-Criteria Lookup From one of Many Columns

Time:12-04

Trying to add a factor to a dataframe based on a lookup of multiple criteria in another dataframe. Code to create sample data:

import pandas as pd

df_RawData = pd.DataFrame({
    'Value' : [31000, 36000, 42000],
    'Type' : [0,1,5]
})

df_Lookup = pd.DataFrame({
    'Min Value' : [0,10000,20000,25000,30000,35000,40000,45000],
    'Max Value' : [9999,19999,24999,29999,34999,39999,44999,49999],
    'Type 0' : [.11,.21,.31,.41,.51,.61,.71,.81],
    'Type 1' : [.10,.20,.30,.40,.50,.60,.70,.80],
    'Type 2' : [.09,.19,.29,.39,.49,.59,.69,.79],
    'Type 3' : [.08,.18,.28,.38,.48,.58,.68,.78],
    'Type 4' : [.07,.17,.27,.37,.47,.57,.67,.77],
    'Type 5' : [.06,.16,.26,.36,.46,.56,.66,.76]
})

I need to add a column to the first data frame based on both the value being in range of the min and max value and returning only the factor from the matching type. Final desired output in this case would be:

Value Type Factor
31000 0 .51
36000 1 .60
42000 5 .66

RawData is a dataset with at least half a million rows.

I tried using IntervalIndex, but can't figure out how to return values from differing columns based on type. This, for example, would handle the min/max lookup and always return the factor from type 5:

v = df_Lookup.loc[:, 'Min Value':'Max Value'].apply(tuple, 1).tolist()
idxr = pd.IntervalIndex.from_tuples(v, closed='both')
df_RawData['Factor'] = df_Lookup.loc[idxr.get_indexer(df_RawData['Value']),['Type 5']].values

Alternately, I thought about using melt to rearrange the lookup dataframe, but am unsure on how to merge on type as well as being within the min/max range. If the dataset were smaller, I would use vlookup in Excel with an if statement in the return column portion of the formula, but that's not practical given the size of the dataset.

CodePudding user response:

Create the intervalindex:

intervals = pd.IntervalIndex.from_arrays(df_Lookup['Min Value'], 
                                         df_Lookup['Max Value'], 
                                         closed='neither')

Get the matching positions:

pos = intervals.get_indexer(df_RawData.Value)

Index the Type columns - fortunately they are sorted:

types = df_Lookup.filter(like='Type').to_numpy()
out = types[pos, df_RawData.Type]

Assign value:

df_RawData.assign(Factor = out)

   Value  Type  Factor
0  31000     0    0.51
1  36000     1    0.60
2  42000     5    0.66
  • Related