Is there a faster method than np.isin and np.where for large arrays?-CodePudding

I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays

a boolean one that checks whether the first column of B is in A
another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise

Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?

But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):

import numpy as np
import numba as nb

@nb.jit(parallel=True)
def isinvals(arr, vals):
    n = len(arr)
    result = np.full(n, False)
    result_vals = np.full(n, np.nan)
    set_vals = set(vals[0,:])
    list_vals = list(vals[0,:])
    for i in nb.prange(n):
        if arr[i] in set_vals:
            ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
            result[i] = True
            result_vals[i] = vals[1,ind]
    return result, result_vals


N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])

%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The line I've marked above (list_vals.index(arr[i])) is the slow part. If I don't use that I can make a super fast version:

@nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
    n = len(arr)
    result = np.full(n, False)
    result_vals = np.full(n, np.nan)
    set_vals = set(vals[0,:])
    list_vals = list(vals[0,:])
    for i in nb.prange(n):
        if arr[i] in set_vals:
            ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
            result[i] = True
            result_vals[i] = vals[1,ind]
    return result, result_vals

%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

i.e. that single line is making it 40 times slower.

Any ideas? I've also tried using np.where() but it's even slower.

CodePudding user response：

Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values in vals[0, idx] with different corresponding values vals[1, idx]. A lookup table is faster, but needs len(arr) additional space.

@nb.njit  # tested with numba 0.55.1
def isin_nb(arr, vals):
    lookup = np.empty(len(arr), np.float32)
    lookup.fill(np.nan)
    lookup[vals[0, ::-1]] = vals[1, ::-1]
    res_val = lookup[arr]
    return ~np.isnan(res_val), res_val

With the example data used in the question

res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop

Asserting equal results

np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)