Optimizing array additions and multiplications with transposes-CodePudding

I would like to do addition and multiplication on top of array transposes.

Given A is an array. sum = p(A) * factor(p)

where p is a permutation/transpose, factor(p) is a list of prefactors. For example, A is a 2 dimensional array, the target is

0.1*A   0.1*transpose(A,(1,0))

I found in my Python code, when the higher dimensional permutations are applied, the time for array addition is far larger than transpose. Perhaps numpy.transpose uses pointer in C. I want to know, is there any method to optimize the timing for the array addition part? numpy.add does not help much. Should I somehow, only sum the part of transpose which affects the array, the rest part use multiplication? For example, a permutation (0,1,3,2), the (0,1) part multiplies a common factor on top of the previous array. Or use cpython to improve the performance?

Here is my Python code

import numpy as np
import time
import itertools as it

ref_list = [0, 1, 2, 3, 4]
p = it.permutations(ref_list)
transpose_list = tuple(p)
print(type(transpose_list),type(transpose_list[0]),transpose_list[0])


n_loop = 2
na = nb = nc = nd = ne = 20
A = np.random.random((na,nb,nc,nd,ne))
sum_A = np.zeros((na,nb,nc,nd,ne))
factor_list = [i*0.1 for i in range(120)]

time_transpose = 0
time_add = 0
time_multiply = 0

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        start = time.time()
        B = np.transpose(A, transpose_list[m] )  
        finish = time.time()
        time_transpose  = finish - start

        start = time.time()
        B_p = B * factor_list[m]  
        finish = time.time()
        time_multiply  = finish - start

        start = time.time()
        sum_A  = B_p
        finish = time.time()
        time_add  = finish - start

print(time_transpose, time_multiply, time_add, time_multiply/time_transpose, time_add/time_transpose)

The output is

0.004961967468261719 1.1218750476837158 3.7830252647399902 226.09480107630213 762.404285988852

The addition time is about ~700 folds larger than the transpose.

I tried to use numba's transpose in How to avoid huge overhead of single-threaded NumPy's transpose?

by adding


import numba as nb

@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
    blockSize, tileSize = 256, 32  # To be tuned
    n, m = mat.shape
    assert blockSize % tileSize == 0
    for tmp in nb.prange((m blockSize-1)//blockSize):
        i = tmp * blockSize
        for j in range(0, n, blockSize):
            tiMin, tiMax = i, min(i blockSize, m)
            tjMin, tjMax = j, min(j blockSize, n)
            for ti in range(tiMin, tiMax, tileSize):
                for tj in range(tjMin, tjMax, tileSize):
                    out[ti:ti tileSize, tj:tj tileSize] = mat[tj:tj tileSize, ti:ti tileSize].T

and use

B = transpose(A, transpose_list[m] )

received

Traceback (most recent call last):
  File "transpose_test_v2.py", line 46, in <module>
    B = transpose(A, transpose_list[m] )
  File "/home/.../lib/python3.8/site-packages/numba/core/dispatcher.py", line 717, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 6d, C), UniTuple(int64 x 6)

or using B = nb.transpose(A, transpose_list[m] ) and got

    B = nb.transpose(A, transpose_list[m] )
AttributeError: 'int' object has no attribute 'transpose'

CodePudding user response：

np.transpose does not physically transpose the array in memory. It change the strides of the array so that the resulting view is (virtually) transposed. The thing is the memory layout of the view is very inefficient after the transposition so that operations on it are not efficient. You can perform a copy of the transposed view or call np.ascontiguousarray so to perform the transposition eagerly. Be aware that the current implementation of the Numpy copy of transposed arrays is quite inefficient.

Here are results on my machine:

Initial code:
0.0030066967010498047 1.2316536903381348 1.971841812133789 409.6368249940528 655.8166679882642

With a copy of the transposed array:
2.483657121658325 1.221949577331543 0.6028566360473633 0.4919960837893974 0.2427294133277298

Besides this, preallocating the arrays and using the parameter out of np.add and np.multiply should help to make the multiply and the add faster. For faster performance, you can use a parallel Numba code with loops. Optimizing the transposition in you case is pretty hard due to the high number of dimension.

Update:

Here is a faster Numba implementation that mostly removes the cost of the add/multiply and reduce the cost of the transposition using multiple threads:

# Same setup than in the question

@numba.njit('(float64[:,:,:,:,::1], float64[:,:,:,:,::1], float64, int_, int_, int_, int_, int_)', parallel=True, fastmath=True)
def computeRound(A, sum_A, factor, i0, i1, i2, i3, i4):
    size = 20
    B = np.transpose(A, (i0, i1, i2, i3, i4))
    for j0 in numba.prange(size):
        for j1 in range(size):
            for j2 in range(size):
                for j3 in range(size):
                    for j4 in range(size):
                        sum_A[j0, j1, j2, j3, j4]  = B[j0, j1, j2, j3, j4] * factor

for n in range(n_loop):
    for m, t in enumerate(transpose_list):
        i0, i1, i2, i3, i4 = transpose_list[m]
        computeRound(A, sum_A, factor_list[m], i0, i1, i2, i3, i4)

This code is 4 times faster than the initial one. It is almost fully bound by the bad access pattern of the transposition. However, this is hard to optimize the transposition. One solution is to apply the transposition in an order that maximize cache locality from the previous transposition (fast but especially hard to implement). A simpler (but slower) solution is simply to manually work on the (i0, i1, i2, i4, i3) case at the same time than the (i0, i1, i2, i3, i4) case using the previously-provided optimized Numba transposition code.

CodePudding user response：

Your error message looks like "(m blockSize-1)//blockSize" part where division is giving you double precision float. Instead of dividing to blockSize(256), you can right-shift for 8 times and keep the integerness.