Improve performance broadcast multiplication and 1

How can I improve the last two "operations" of this code in terms of time? (minus1_power_x_t*x and 1-minus1_power_x). They are taking 1.73 and 1.47 seconds respectively. I asked for the last two operations because the others ones will be constants.

import time
from multiprocessing import Pool
import multiprocessing
import numpy as np


def create_ext_component_function(dim):
    x = np.array([i for i in range(dim)], dtype=float)
    t3 = time.time()
    y_shift = np.fromfunction(lambda i,j: (i)>>j, ((2**dim), dim), dtype=np.uint32)
    elapsed3 = time.time() - t3
    print('y_sfhit Elapsed: %s' % elapsed3)
    t3 = time.time()
    um=np.ones(((2**dim), dim), dtype=np.uint8);
    elapsed3 = time.time() - t3
    print('um Elapsed: %s' % elapsed3)
    t3 = time.time()
    and_list = np.bitwise_and(y_shift,um)
    elapsed3 = time.time() - t3
    print('and_list Elapsed: %s' % elapsed3)
    t3 = time.time()
    minus1_power_x_t = np.power(-1,and_list)
    elapsed3 = time.time() - t3
    print('minus1_power_x_t Elapsed: %s' % elapsed3)
    # I need to improve the last two operations
    t3 = time.time()
    minus1_power_x = minus1_power_x_t*x
    elapsed3 = time.time() - t3
    print('minus1_power*x Elapsed: %s' % elapsed3)
    t3 = time.time()
    um_minus_minus1_power = 1-minus1_power_x
    elapsed3 = time.time() - t3
    print('um_minus_minus1_power Elapsed: %s' % elapsed3)
    return um_minus_minus1_power


if __name__ == '__main__':
    dim = 24
    print(create_ext_component_function(dim))

EDIT: Taking in account that minus1_power_x_t are values -1 or 1.

CodePudding user response：

The problem with this code is that it creates many big temporary arrays for basic memory-bound operations. Big temporary arrays are slow to fill because of the speed of the RAM (unsurprisingly) and also because page fault.

Indeed, the operating system (OS) perform the mapping between the the virtual memory page and physical memory pages when the big temporary array are filled for the first time. This process is slow because the Numpy code is sequential (so the page faults), pages are generally small and so there is a lot of pages to map (typically 4 KB on a x86 system) and most OS pre-fill pages to zero for security reasons (so that the page is not filled with your bank account cumming from a recently closed browser tab). Note that there are (transparent) solutions to reduce the number of pages (using huge pages), but there are costly in this case too due to the pre-fill.

The best thing to do to solve this problem is to minimize the number of temporary buffers. This can be done thanks to the out argument of Numpy many functions (eg. subtract). You can also compute the arrays in-place for better performance. This solution also reduce the memory footprint so that the memory is not swapped on your slow storage device (swap memory). An alternative solution is to use a parallel implementation of Numpy or to write a parallel Numba/Cython code (Numba is probably the best option here). Another solution is to use Numexpr.

Note that using smaller data-types also help to improve the performance of the code (as the raw buffer in memory will be smaller and so faster to read/write/pre-fill). Using float32 is faster than float64 although is may not fit your needs. The same applies for integers (eg. int8 vs int32 vs int64 for and_list and minus1_power_x_t).

Here is an example:

# [...]

# The dtype parameter is important to reduce the size in memory
minus1_power_x_t = np.power(-1,and_list, dtype=np.int8)

# Pre-fill a buffer in memory with 0.0 (can be reused multiple times)
buffer = np.full(minus1_power_x_t.shape, 0.0)

# Note that minus1_power_x is overwritten once um_minus_minus1_power has been computed.
# If you need both, you can use 2 pre-filled buffers (only usefull if used multiple times like in a loop)
minus1_power_x = np.multiply(minus1_power_x_t, x, out=buffer)
um_minus_minus1_power = np.subtract(1.0, minus1_power_x, out=buffer)

With this method, the multiply is about 2.5 times faster on my (Intel Xeon) machine and the subtract is about 4 times faster.

Numexpr can be used to fuse the multiply and the subtract. It also support user-defined output buffers. Moreover, it can parallelize the computation. Here is an example:

um_minus_minus1_power = numexpr.evaluate('1.0 - minus1_power_x_t * x', out=buffer)

The Numexpr code is about 12.3 times faster on my machine. Note that using a float32 arrays instead of float64 ones should be about 2 times faster.