I have a working algorithm to analyse experimental datasets. This algorithm is made of two main functions. The first one takes a large array as one of its inputs and returns an intermediate 3D, complex-valued array that often does not fit in memory. For this reason, I use Numpy’s memmap
to save this array on the disk. With larger datasets, the second function starts to take a lot more time and it appears to be linked to memory access. In some instances, if a computation lasts 20 minutes, then increasing n2
by 50% results in a computation that takes almost 24 hours.
After stripping about 99% of the whole program, a minimal working example looks like this:
import numpy as np
n1, n2 = 10, 100
Nb = 12
Nf = int(n2 / Nb)
X = np.random.rand(n1, n2)
# Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w ')
for n in range(Nb):
Xn = X[:, n*Nf:(n 1)*Nf]
X2[n,:,:] = np.fft.fft(Xn, axis=1)
X2.flush()
del X2
# Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')
for k in range(Nf):
Y = np.pi * X2[:,:,k] # <- Problematic step
# do things with Y...
The values of n1
, n2
, Nb
and Nf
and typically much larger.
As you can see, in function 1, X2
is populated using its first index, which according to this post is the most efficient way to do it in terms of speed. My problem occurs in function 2, where I need to work on X2
by slicing it on its third dimension. This is required by the nature of the algorithm.
I would like to find a way to refactor these functions to reduce the execution time. I already improved some parts of the algorithm. For example, since I know X
contains only real values, I can cut the size of X2
by 2 if I neglect its conjugate values. However, the slicing of X2
remains problematic. (Actually, I just learned about np.fft.rfft
while writing this, which will definitly help me).
Do you see a way I could refactor function 2 (and/or function 1) so I can access X2
more efficiently?
CodePudding user response:
At least the for k in range(Nf):
- loop can be accelerated using map
on a splitted array:
import numpy as np
n1, n2 = 10, 100
Nb = 12
Nf = int(n2 / Nb)
X = np.random.rand(n1, n2)
# Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w ')
for n in range(Nb):
Xn = X[:, n*Nf:(n 1)*Nf]
X2[n,:,:] = np.fft.fft(Xn, axis=1)
X2.flush()
del X2
# Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')
# define worker function
count = 0
def do_stuff(chunk):
global count
chunk.reshape(Nb, n1)
# do work
count = 1
return count
res = map(do_stuff, np.split(X2, Nf, 2))
print(list(res))
X2.flush()
del X2
I have included a count
variable to check whether the splitting works as expected. The chunk.reshape(Nb, n1)
is needed to make the chunk equivalent to X2[:,:,k]
. In case you can adopt to different shape, you can omit the reshape.
CodePudding user response:
An easy optimization is to swap the last two axes, which shouldn't change the speed of function 1 (assuming the out-of-order memory accesses from the transpose are negligible compared to disk access) but should make function 2 run faster, for the same reasons discussed in the question you linked:
# Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='w ')
for n in range(Nb):
Xn = X[:, n*Nf:(n 1)*Nf]
X2[n,:,:] = np.fft.fft(Xn, axis=1).T # swap axes so Nf changes slower
X2.flush()
del X2
# Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='r')
for k in range(Nf):
Y = np.pi * X2[:,k,:]
# do things with Y...
You can make function 2 even faster at the cost of a slowdown in function 1 by moving Nf
to axis 0:
# Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='w ')
for n in range(Nb):
Xn = X[:, n*Nf:(n 1)*Nf]
X2[:,n,:] = np.fft.fft(Xn, axis=1).T # swap axes so Nf changes slower
X2.flush()
del X2
# Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='r')
for k in range(Nf):
Y = np.pi * X2[k,:,:]
# do things with Y...
It might make sense to use this version if X2 is read more times than it's written. Also, the slowdown of function 1 should get smaller as n1
gets bigger, since the contiguous chunks are larger.
With the data files stored on a hard drive and n1, n2, Nb = 1000, 10000, 120
, my timings are
function 1, original: 1.53 s ± 41.9 ms
function 1, 1st optimization: 1.53 s ± 27.8 ms
function 1, 2nd optimization: 1.57 s ± 34.9 ms
function 2, original: 111 ms ± 1.2 ms
function 2, 1st optimization: 45.5 ms ± 197 µs
function 2, 2nd optimization: 27.8 ms ± 29.7 µs