Home > Blockchain >  Is there a faster method than np.isin and np.where for large arrays?
Is there a faster method than np.isin and np.where for large arrays?

Time:04-29

I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays

  • a boolean one that checks whether the first column of B is in A
  • another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise

Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?

But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):

import numpy as np
import numba as nb

@nb.jit(parallel=True)
def isinvals(arr, vals):
    n = len(arr)
    result = np.full(n, False)
    result_vals = np.full(n, np.nan)
    set_vals = set(vals[0,:])
    list_vals = list(vals[0,:])
    for i in nb.prange(n):
        if arr[i] in set_vals:
            ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
            result[i] = True
            result_vals[i] = vals[1,ind]
    return result, result_vals


N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])

%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The line I've marked above (list_vals.index(arr[i])) is the slow part. If I don't use that I can make a super fast version:

@nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
    n = len(arr)
    result = np.full(n, False)
    result_vals = np.full(n, np.nan)
    set_vals = set(vals[0,:])
    list_vals = list(vals[0,:])
    for i in nb.prange(n):
        if arr[i] in set_vals:
            ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
            result[i] = True
            result_vals[i] = vals[1,ind]
    return result, result_vals

%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

i.e. that single line is making it 40 times slower.

Any ideas? I've also tried using np.where() but it's even slower.

CodePudding user response:

Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values ​​in vals[0, idx] with different corresponding values vals[1, idx]. A lookup table is faster, but needs len(arr) additional space.

@nb.njit  # tested with numba 0.55.1
def isin_nb(arr, vals):
    lookup = np.empty(len(arr), np.float32)
    lookup.fill(np.nan)
    lookup[vals[0, ::-1]] = vals[1, ::-1]
    res_val = lookup[arr]
    return ~np.isnan(res_val), res_val

With the example data used in the question

res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop

Asserting equal results

np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)
  • Related