Is there a way to populate np select conditions from a dataframe?-CodePudding

Consider a lookup table like this:

   lower_bound upper_bound category
0       3          6         A
1       10         40        B
2       80         200       C
3       350        600       D
4       900        1500      E

Then, there is a DataFrame of elements that we need to classify based on the above conditions:

    id    value
0  id_1     20
1  id_2    500
2  id_3   1000

Each of these items would be classified according to their value, which should be between one of the specified ranges that exist in the lookup table. So:

    id   value  category
0  id_1    20       B
1  id_2    500      D
2  id_3    1000     E

I have tried the following, which works:

conditions = []
choices = []
for condition in lookup_df.to_dict('records'):
    conditions.append(
        (df['value'].between(condition['lower_bound'], condition['upper_bound']))
    )
    choices.append(condition['category'])

And afterwards:

df['category'] = np.select(conditions, choices, default=np.nan)

Is there a way to generate the conditions without converting to dict and looping the lookup DataFrame?

CodePudding user response：

You can use np.dot:

vals = np.vstack(df['value'].values)
lb = condition['lower_bound'].values <= vals
ub = vals <= condition['upper_bound'].values
df['category'] = np.dot(lb & ub, condition['category'])

Output:

id	value	category
id_1	20	B
id_2	500	D
id_3	1000	E

CodePudding user response：

You can use numpy broadcasting. The idea is to create a boolean mask that returns True for the range each "value" falls into in lookup_df. Then select the matching values using boolean indexing.

vals = df['value'].to_numpy()
msk = (lookup_df[['lower_bound']].to_numpy() < vals) & (vals < lookup_df[['upper_bound']].to_numpy())
df['category'] = lookup_df[['category']].to_numpy().repeat(len(df), axis=1)[msk]

Output:

     id  value category
0  id_1     20        B
1  id_2    500        D
2  id_3   1000        E

CodePudding user response：

You could use pd.merge_asof:

output = pd.merge_asof(df, lookup_df[["lower_bound","category"]], left_on="value", right_on="lower_bound").drop("lower_bound", axis=1)

>>> output
     id  value category
0  id_1     20        B
1  id_2    500        D
2  id_3   1000        E