np.where issue above a certain value (#Numpy)-CodePudding

I'm facing to 2 issues in the following snippet using np.where (looking for indexes where A[:,0] is identical to B)

Numpy error when n is above a certain value (see error)
quite slow

DeprecationWarning: elementwise comparison failed; this will raise an error in the future.

So I'm wondering what I'm missing and/or misunderstanding, how to fix it, and how to speed-up the code. This is a basic example I've made to mimic my code, but in fact I'm dealing with arrays having (dozens of) millions of rows.

Thanks for your support

Paul

import numpy as np
import time

n=100_000  # with n=10 000 ok but quit slow
m=2_000_000



#matrix A
# A=np.random.random ((n, 4))
A = np.arange(1, 4*n 1, dtype=np.uint64).reshape((n, 4), order='F')

#Matrix B
B=np.random.randint(1, m 1, size=(m), dtype=np.uint64)
B=np.unique(B) # duplicate values are generally generated, so the real size remains lower than n

# use of np.where
t0=time.time()
ind=np.where(A[:, 0].reshape(-1, 1) == B)
# ind2=np.where(B == A[:, 0].reshape(-1, 1))
t1=time.time()
print(f"duration={t1-t0}")

CodePudding user response：

In your current implementation, A[:, 0] is just

np.arange(n/4, dtype=np.uint64)

And if you are interested only in row indexes where A[:, 0] is in B, then you can get them like this:

row_indices = np.where(np.isin(first_col_of_A, B))[0]

If you then want to select the rows of A with these indices, you don't even have to convert the boolean mask to index locations. You can just select the rows with the boolean mask: A[np.isin(first_col_of_A, B)]
There are better ways to select random elements from an array. For example, you could use numpy.random.Generator.choice with replace=False. Also, Numpy: Get random set of rows from 2D array.
I feel there is almost certainly a better way to do the whole thing that you are trying to do with these index locations. I recommend you study the Numpy User Guide and the Pandas User Guide to see what cool things are available there.

Honestly, with your current implementation you don't even need the first column of A at all, because row indicies simply equal the elements of A[:, 0]. Here:

row_indices = B[B < n]
row_indices.sort()
print(row_indices)