Speed up the initialization of 3D matrices in Numpy-CodePudding

I recently noticed a significant slowdown when working on freshly initialized Numpy arrays compared to already initialized arrays. Generally it seems logical that it takes longer to initialize the array, but I didn't expect such a big difference. The snippet is a schematic part of a function I need and just these two lines to create dim3 take about half of the total runtime of the function.

import numpy as np

mask = np.where(np.random.rand(150,150) > 0.98)
very_important_data = np.random.rand(len(mask[0]), 1000)

dim3 = np.zeros((150,150,1000))

%timeit dim3[mask] = very_important_data    # --> 114 µs ± 5.24 µs per loop

%timeit dim3 = np.zeros((150,150,1000)); dim3[mask] = very_important_data   # --> 9.4 ms ± 585 µs per loop

Is there a more efficient way to pre-initialize the dim3-matrix? Or an efficient way to keep a matrix in memory that is set to zero before the new values are assigned?

Thanks!

CodePudding user response：

You can use np.empty() instead:

%timeit dim3 = np.zeros((150,150,1000)); dim3[mask] = very_important_data
%timeit dim3 = np.empty((150, 150, 1000)); dim3[mask] = very_important_data

Output (on macOS 12.4):

5.3 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
190 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note: this only gives the same result if you set all elements, see the docs: https://numpy.org/doc/stable/reference/generated/numpy.empty.html

CodePudding user response：

To understand if there a more efficient way, one first need to understand why the current code is slow.

Numpy methods creating arrays like np.zeros or np.empty (or any method creating temporary array like multiplication, addition etc.) request a memory buffer from the CPython allocator which forward it to the default libc allocator (which is different from the one of the OS) or a custom allocator if any. np.zeros request a buffer pre-filled with zeros while np.empty just request a raw buffer.

The default allocator behave differently regarding the platform (mainly the operating system). On Windows, it requests memory to the OS and free it systematically for big buffer while the default memory allocators of Mac and Linux tends to be more conservative: they keep pretty big local chunks of memory and try to reuse them as much as possible rather than releasing the space to the OS.

This default policy has a drastic impact on performance and memory usage. Indeed, the allocator needs to fill all the values to 0 when a zero-filled memory buffer is requested from Numpy and the buffer is recycled from a previously allocated space (not yet released to the OS). However, when a zero-filled memory is directly requested from the OS, then the OS can return a virtual memory buffer that will be filled lazily only when a first-touch is performed on specific memory pages. This means the allocation can be much faster for huge array but the overhead of filling the array with zeros is delayed. In the end, the overhead of filling the array will be paid as long as all pages are read/written (ie. the array is completely read or written with some values). Actually, this lazy memory filling is more expensive than if the buffer would be recycled by the allocator due to page-faults. Some OS prefill memory chunks (possibly in separate threads) to speed up such zero-filled buffer requests. As a result, you should be very careful about the way you benchmark your application.

In practice, the memory requested to the OS is always filled with zeros on mainstream platforms (by default on Windows, Linux and Mac) because of security reasons: the memory previously allocated, filled and released by a process must not be accessible from another process since the memory chunks can contains sensitive information (for example your browser can store password in memory and you do not expect Numpy python script to be able to read them without any privileges). This zero filling is generally done at page-fault time. Thus, calling np.empty or np.zeros gives the same result when the array is requested from the OS. However, when the array is recycled by the allocator, then np.empty can be much faster and there is (generally) no page-fault overhead to pay (page-faults are done once per page as long as the memory pages are not stored somewhere else like in swap when you run out of memory).

Put it shortly, there is no way (only from Python) to speed up the creation of an array as long as you request the creation of a new array and you read/write all the target array. Using a custom system allocator does not help much since the array have to be filled anyway. If it is Ok for you to pay the overhead progressively, then you need to using a manual memmap. Otherwise you can preallocate some buffer and recycle them yourself. It can be faster because you may not need to fully fill them to zero and you will not pay the cost of pages faults. There is no free lunch.