I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays
- a boolean one that checks whether the first column of B is in A
- another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise
Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?
But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):
import numpy as np
import numba as nb
@nb.jit(parallel=True)
def isinvals(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])
%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The line I've marked above (list_vals.index(arr[i])
) is the slow part. If I don't use that I can make a super fast version:
@nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i.e. that single line is making it 40 times slower.
Any ideas? I've also tried using np.where() but it's even slower.
CodePudding user response:
Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values in vals[0, idx]
with different corresponding values vals[1, idx]
. A lookup table is faster, but needs len(arr)
additional space.
@nb.njit # tested with numba 0.55.1
def isin_nb(arr, vals):
lookup = np.empty(len(arr), np.float32)
lookup.fill(np.nan)
lookup[vals[0, ::-1]] = vals[1, ::-1]
res_val = lookup[arr]
return ~np.isnan(res_val), res_val
With the example data used in the question
res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop
Asserting equal results
np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)