I'm looking for an efficient NumPy implementation for the following:
My inputs are:
- X = 2d NumPy array of shape (m,3) (m can be anywhere from 10 to 10 Million)
- Q = 2d NumPy array with shape (n,p). Both n and p are small (<= 10)
It is easier to explain what I'm looking for using the following example.
Example:
- X = np.array([[2, 56, 20], [3, 194, 18], [4, 36, 6], [5, 168, 42]]) with m = 4
- Q = np.array([[3, 7, 11], [5, 13, 17]]) with n = 2 and p = 3
For each row j of X, I want to find all the rows of Q that contain an entry q such that X[j,1]%q == X[j,2]%q. For the above example, X[0,1]%q == X[0,2]%q for q = 3, which is in row 0 of Q. Similarly, X[2,1]%q == X[2,2]%q for q = 3 and q = 5, which are in rows 0 and 1 of Q. The output should be list of lists/NumPy arrays with matching rows of Q for each row of X. For the above example, the output should be: [[0], [0], [0,1], [0]]
I have written the following code, but am wondering if there is a more efficient approach.
result = []
for i in range(len(X)):
result.append(np.where(X[i,1] % Q == X[i,2] % Q)[0])
CodePudding user response:
Try this:
I used sets because for sets no duplicates allowed, but for lists they are allowed.
import numpy as np
X = np.array([[2, 56, 20], [3, 194, 18], [4, 36, 6], [5, 168, 42]])
Q = np.array([[3, 7, 11], [5, 13, 17]])
result = []
for i in range(len(X)):
temp = []
for j in range(len(Q)):
v1 = Q[j]
for k in range(len(Q[j])):
v2 = v1[k]
if X[i, 1] % v2 == X[i, 2] % v2:
temp.append(j)
else:
continue
result.append(set(temp))
result = np.array(result, dtype=object)
print(result)
CodePudding user response:
This computes all modulo combinations at once:
P = X[:, 1:, np.newaxis, np.newaxis] % Q
The output shape will be [m, 2, n, p]. Note: If you can, change the integer size to i4 or u4 as 32 bit modulo is faster than 64 bit; or go even smaller. It also limits memory consumption which may be significant here.
Now we can do all comparisons at once:
B = P[:, 0] == P[:, 1]
The output shape will be [m, n, p]
To find the rows where a suitable q exists, we use np.any
F = np.any(B, axis=-1)
The shape will be [m, n], for each combination of X and Q, a boolean whether any q in Q matches the row in X.
To get the indices, use np.nonzero
Xs, Qs = np.nonzero(F)
for x, q in zip(Xs, Qs):
print("X index %d matches Q index %d" % (x, q))