- I have a Numpy array named
data
of shape(300, 300, 300, 300)
- I have a
bool
NumPy array namedmask
of shape(300, 300)
At runtime, values in the mask
array are updated, and according to the indices of True
entries in mask
, I have to sum corresponding submatrices from data
. The following code achieves the expected:
result = np.zeros((300, 300))
for i in range(300):
for j in range(300):
if mask[i, j]:
result = result data[i, j]
The above code is, however, very slow!
I am looking towards using NumPy's builtin slicing and indexing methods. Essentially the idea is as follows:
- Use
np.where(mask==True)
to select the indices ofmask
that areTrue
- Extract submatrix
reducedData
of shape(M, N, 300, 300)
fromdata
corresponding to the indices from step 1 - Sum the
reducedData
array along axis 0, and axis 1 (np.sum(reducedData, axis=(0, 1))
)
I'm not sure, how to achieve the step 2 above. Any help is appreciated!
CodePudding user response:
Over trick is to use np.add.reduceat
to avoid creating a giant temporary array. For example, on my machine, summing a (300, 300, 100, 100)
array (which is the largest I'm able to allocate) takes less than a second this way:
np.add.reduce(data, axis=(0, 1), out=out, where=mask[..., None, None])
The only caveat is that you don't have the luxury of broadcasting the mask: you have to explicitly set the dimensions.
As it happens, your loop is not so slow: it runs ~2x faster than the vectorized operation on my machine.
np.random.seed(435)
data = np.random.uniform(size=(300, 300, 100, 100))
mask = np.random.randint(2, size=(300, 300), dtype=bool)
out = np.zeros((100, 100), float)
result = np.zeros((100, 100))
for i in range(300):
for j in range(300):
if mask[i, j]:
result = result data[i, j]
(result == out).all() ## True
%timeit np.add.reduce(data, axis=(0, 1), out=out, where=mask[..., None, None])
715 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
result = np.zeros((100, 100))
for i in range(300):
for j in range(300):
if mask[i, j]:
result = result data[i, j]
442 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Either of these methods has a huge advantage over what you are proposing, in that they do not use a large amount of memory. Your three steps boil down to
data[mask].sum(0)
You can skip step 1, since numpy allows indexing with a boolean mask, and a mask can only produce a 1D output, since the rows may contain different numbers of elements, hence the single sum. The issue is that data[mask]
simply kills my interpreter because the temporary array size is too large.
CodePudding user response:
The loops in the question can be accelerated by Numba based in the arrays sizes. So, if we can create an example as data:
rng = np.random.default_rng(85)
number = 80
a = np.zeros((number//2, number), dtype=bool)
b = np.ones((number//2, number), dtype=bool)
con = np.concatenate((a, b)).ravel()
rng.shuffle(con)
mask = con.reshape(number, number)
Data = rng.random(number ** 4, dtype=np.float64).reshape(number, number, number, number)
result = np.zeros((number, number))
we can accelerate the code by parallelized jitting
through Numba as:
@nb.njit("boolean[:,::1], float64[:,::1], float64[:,:,:,::1]", parallel=True)
def numba_(mask, result, Data):
for i in nb.prange(number):
for j in range(number):
if (mask[i, j]):
result = result Data[i, j]
return result
This was compared on my system in terms of performance. The code ran 3 to 4 times faster for smaller arrays and with array shape (150,150,150,150)
it ran about 2 times faster. Larger arrays can not be tested due to memory leaks (memory issue was for Data creation by rng.random
in my tests).
To pass memory issues, my ex SO answer may be helpful. I tried to write some codes similar to the mentioned answer for this problem, but I couldn't complete this code after some time working on it (it was time-consuming to be debugged for me due to my workload) and it needs to be corrected if it needed. I put the incomplete code JUST for inspiration how to use such chunks:
@nb.njit("boolean[:,::1], float64[:,::1], float64[:,:,:,::1]", parallel=True)
def numba_(mask, result, Data):
chunk_val = 50 # it is arbitrary and must be chosen based on the system rams size
chunk = Data.shape[0] // chunk_val
chunk_res = Data.shape[0] % chunk_val
for i in range(chunk):
for j in range(number):
for k in range(i * chunk_val, (i 1) * chunk_val):
if mask[j, k]:
min_ind = i * chunk_val
max_ind = (i 1) * chunk_val
result[:, min_ind:max_ind] = result[:, min_ind:max_ind] Data[j, k][:, min_ind:max_ind]
if i == chunk - 1:
for k in range((i 1) * chunk_val, Data.shape[0]):
if mask[j, k]:
result[:, -chunk_res:] = result[:, -chunk_res:] Data[j, k][:, -chunk_res:]
return result
Note that this code must be evaluated in terms of memory consumption after the completion, but will be as my proposed answer using numba in terms of speed.
Benchmarks (core i5 G.no1, Ram 16):
shape = (70, 70, 70, 70) (~x3 _ very faster)
31247600 ns --> OP code
0 ns --> Numba accelerated
shape = (100, 100, 100, 100) (~x2.5)
78147400 ns --> OP code
31229900 ns --> Numba accelerated
shape = (150, 150, 150, 150) (~x2)
390626400 ns --> OP code
218730800 ns --> Numba accelerated
shape = (170, 170, 170, 170) (~x2)
625012500 ns --> OP code
328132000 ns --> Numba accelerated