Home > other >  numpy isin for multi-dimmensions
numpy isin for multi-dimmensions

Time:12-22

I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin but it clearly states in it's documentation:

The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.

Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:

a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])

I would like to have result like:

np.array([
       [True, True, False, False, False, False, False, False, False, False],
       [False, False, True, True, False, False, False, False, False, False],
       [False, False, False, False, True, True, False, False, False, False],
       [False, False, False, False, False, False, True, True, False, False],
       [False, False, False, False, False, False, False, False, True, True]
])

CodePudding user response:

Try numpy.apply_along_axis to work with numpy.isin:

np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) 

returns

array([[[ True,  True, False, False, False, False, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False,  True,  True, False, False, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False,  True,  True, False, False, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False, False, False,  True,  True, False, False]],                                                                                                                                                                                                                                      
       [[False, False, False, False, False, False, False, False,  True, True]]]) 

I will update with an edit comparing the runtime with a list comp

EDIT:

Whelp, I tested the runtime, and wouldn't you know, listcomp is faster

timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals()) 
0.37380070000654086

vs

timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601 

the other answer to this post by @mozway is much faster:

timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())                                           
0.007107900004484691

and should probably be accepted.

CodePudding user response:

You can use broadcasting to avoid any loop (this is however more memory expensive):

(a == b[...,None]).any(-2)

Output:

array([[ True,  True, False, False, False, False, False, False, False, False],
       [False, False,  True,  True, False, False, False, False, False, False],
       [False, False, False, False,  True,  True, False, False, False, False],
       [False, False, False, False, False, False,  True,  True, False, False],
       [False, False, False, False, False, False, False, False,  True  True]])
  • Related