Refactoring an algorithm to avoid inefficient slicing of a large Numpy array-CodePudding

I have a working algorithm to analyse experimental datasets. This algorithm is made of two main functions. The first one takes a large array as one of its inputs and returns an intermediate 3D, complex-valued array that often does not fit in memory. For this reason, I use Numpy’s memmap to save this array on the disk. With larger datasets, the second function starts to take a lot more time and it appears to be linked to memory access. In some instances, if a computation lasts 20 minutes, then increasing n2 by 50% results in a computation that takes almost 24 hours.

After stripping about 99% of the whole program, a minimal working example looks like this:

import numpy as np

n1, n2 = 10, 100

Nb = 12
Nf = int(n2 / Nb)

X = np.random.rand(n1, n2)

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w ')

for n in range(Nb):
    Xn = X[:, n*Nf:(n 1)*Nf]
    X2[n,:,:] = np.fft.fft(Xn, axis=1)

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[:,:,k]  #   <- Problematic step
    # do things with Y...

The values of n1, n2, Nb and Nf and typically much larger.

As you can see, in function 1, X2 is populated using its first index, which according to this post is the most efficient way to do it in terms of speed. My problem occurs in function 2, where I need to work on X2 by slicing it on its third dimension. This is required by the nature of the algorithm.

I would like to find a way to refactor these functions to reduce the execution time. I already improved some parts of the algorithm. For example, since I know X contains only real values, I can cut the size of X2 by 2 if I neglect its conjugate values. However, the slicing of X2 remains problematic. (Actually, I just learned about np.fft.rfft while writing this, which will definitly help me).

Do you see a way I could refactor function 2 (and/or function 1) so I can access X2 more efficiently?

CodePudding user response：

At least the for k in range(Nf):- loop can be accelerated using map on a splitted array:

import numpy as np

n1, n2 = 10, 100

Nb = 12
Nf = int(n2 / Nb)


X = np.random.rand(n1, n2)

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w ')

for n in range(Nb):
    Xn = X[:, n*Nf:(n 1)*Nf]
    X2[n,:,:] = np.fft.fft(Xn, axis=1)

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')

# define worker function
count = 0
def do_stuff(chunk):
    global count
    chunk.reshape(Nb, n1)
    # do work
    count  = 1
    return count

res = map(do_stuff, np.split(X2, Nf, 2))
print(list(res))

X2.flush()
del X2

I have included a count variable to check whether the splitting works as expected. The chunk.reshape(Nb, n1) is needed to make the chunk equivalent to X2[:,:,k]. In case you can adopt to different shape, you can omit the reshape.

CodePudding user response：

An easy optimization is to swap the last two axes, which shouldn't change the speed of function 1 (assuming the out-of-order memory accesses from the transpose are negligible compared to disk access) but should make function 2 run faster, for the same reasons discussed in the question you linked:

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='w ')

for n in range(Nb):
    Xn = X[:, n*Nf:(n 1)*Nf]
    X2[n,:,:] = np.fft.fft(Xn, axis=1).T  # swap axes so Nf changes slower

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[:,k,:]
    # do things with Y...

You can make function 2 even faster at the cost of a slowdown in function 1 by moving Nf to axis 0:

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='w ')

for n in range(Nb):
    Xn = X[:, n*Nf:(n 1)*Nf]
    X2[:,n,:] = np.fft.fft(Xn, axis=1).T  # swap axes so Nf changes slower

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[k,:,:]
    # do things with Y...

It might make sense to use this version if X2 is read more times than it's written. Also, the slowdown of function 1 should get smaller as n1 gets bigger, since the contiguous chunks are larger.

With the data files stored on a hard drive and n1, n2, Nb = 1000, 10000, 120, my timings are

function 1, original:          1.53 s ± 41.9 ms
function 1, 1st optimization:  1.53 s ± 27.8 ms
function 1, 2nd optimization:  1.57 s ± 34.9 ms

function 2, original:          111 ms ± 1.2 ms
function 2, 1st optimization:  45.5 ms ± 197 µs
function 2, 2nd optimization:  27.8 ms ± 29.7 µs