Home > Software design >  Find the first range contains a number (Numpy, Pandas)
Find the first range contains a number (Numpy, Pandas)

Time:03-13

I have a numpy series of numbers:

arr = np.array([1147.8, 1067.2,  957.6,  826.4])

And a pandas DF, with two columns, 'right' and 'left', that describe a range, whereas each range is contained in the next one in the DF:

     right    left 
0    1090     1159.5
1    1080     1169.5
2    1057.5   1191.99

For each number in arr, I would like to get the index of the first range containing it. For the first number (1147.8), it's gonna be 0, since it's in the range (1090, 1159.5). For the second one, it's gonna be 2, since 1067.2 in (1057.5, 1191.99) but not in (1080, 1169.5) (and, of course, the other previous ranges)

I could iterate the DF for each number in arr, but I'm looking for a smarter way. Thanks

CodePudding user response:

Full cross-product between arr and df, then filter, then select first range. That's ok to do for small amounts of data. Ideally, you would do that all at once for all 2000 arrs. With around 2 million rows for the DataFrame after .merge(df_arr, how='cross'), the approach would still work in that case.

df_arr = pd.DataFrame({"arr": arr, 
                       "id_arr": range(len(arr))})

(df.reset_index()
     .merge(df_arr, how='cross')
     .query("right < arr < left")
     .groupby("id_arr")
     .first())

Produces:

        index   right     left     arr
id_arr                                
0           0  1090.0  1159.50  1147.8
1           2  1057.5  1191.99  1067.2

Where index is the index of the tightest range.

The id_arr is used for grouping in case you have duplicate values in arr and you expect duplicate values in the results. If that's not relevant, one could also group by arr directly.

  • Related