I have a numpy series of numbers:
arr = np.array([1147.8, 1067.2, 957.6, 826.4])
And a pandas DF, with two columns, 'right' and 'left', that describe a range, whereas each range is contained in the next one in the DF:
right left
0 1090 1159.5
1 1080 1169.5
2 1057.5 1191.99
For each number in arr, I would like to get the index of the first range containing it. For the first number (1147.8), it's gonna be 0, since it's in the range (1090, 1159.5). For the second one, it's gonna be 2, since 1067.2 in (1057.5, 1191.99) but not in (1080, 1169.5) (and, of course, the other previous ranges)
I could iterate the DF for each number in arr, but I'm looking for a smarter way. Thanks
CodePudding user response:
Full cross-product between arr and df, then filter, then select first range. That's ok to do for small amounts of data. Ideally, you would do that all at once for all 2000 arrs. With around 2 million rows for the DataFrame after .merge(df_arr, how='cross')
, the approach would still work in that case.
df_arr = pd.DataFrame({"arr": arr,
"id_arr": range(len(arr))})
(df.reset_index()
.merge(df_arr, how='cross')
.query("right < arr < left")
.groupby("id_arr")
.first())
Produces:
index right left arr
id_arr
0 0 1090.0 1159.50 1147.8
1 2 1057.5 1191.99 1067.2
Where index
is the index of the tightest range.
The id_arr
is used for grouping in case you have duplicate values in arr and you expect duplicate values in the results. If that's not relevant, one could also group by arr directly.