I have a large numpy array A of shape M*3, whose elements of each row are unique, non-negative integers ranging from 0 to N - 1. In fact, each row corresponds to a triangle in my finite element analysis.
For example, M=4, N=5, and a matrix A looks like the following
array([[0, 1, 2],
[0, 2, 3],
[1, 2, 4],
[3, 2, 4]])
Now, I need to construct another array B of size M*N, such that
B[m,n] = 1 if n is in A[m], or else 0
The corresponding B for the exemplary A above would be
1 1 1 0 0
1 0 1 1 0
0 1 1 0 1
0 0 1 1 1
A loop-based code would be
B = np.zeros((M,N))
for m in range(M):
for n in B[m]:
B[m,n]=1
But since I have large M and N (of scale 10^6 for each), how can I use good Numpy indexing techniques to accelerate this process? Besides, I feel that sparse matrix techniques are also needed since M * N data of 1 byte is about 10**12, namely 1 000 G.
In general, I feel using numpy's vectorization techniques, such as indexing and broadcasting, look more like an ad-hoc, error-prone activity relying on quite a bit of street smarts (or called art, if you prefer). Are there any programming language efforts that can systematically convert your loop-based code to a high-performance vectorized version?
CodePudding user response:
You can directly create a sparse csr-matrix from your data
As you already mentioned in your question a dense matrix consisting of uint8 values would need 1 TB. By using a sparse matrix, this can be reduced to approx. 19 MB as shown in the example bellow.
Creating Inputs with relevant size
This should be included in the question, as it gives a hint on the sparsity of the matrix.
from scipy import sparse
import numpy as np
M=int(1e6)
N=int(1e6)
A=np.random.randint(low=0,high=N,size=(M,3))
Creating a sparse csr-matrix
Have a look at the scipy-doc or for a general overview the wiki article could also be useful.
#array of ones, the same size of non-zero values (3 MB if uint8)
data =np.ones(A.size,dtype=np.uint8)
#you already have the indices, they are expected as an 1D-array (12 MB)
indices=A.reshape(-1)
#every A.shape[1], a new row beginns (4 MB)
indptr =np.arange(0,A.size 1,A.shape[1])
B=sparse.csr_matrix((data,indices,indptr),shape=(M, N))