How to use GPU to speed up matrix operations [closed]-CodePudding

I have the following code:

import numpy as np

A = np.random.random((128,4,46,23))   np.random.random((128,4,46,23)) * 1j
signal = np.random.random((355,256,4))   np.random.random((355,256,4)) * 1j

# start timing here
Signal = np.fft.fft(signal, axis=1)[:,:128,:]
B = Signal[..., None, None] * A[None,...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b**2
res = b_squared.sum(1)

I need to run this code for two different values of A each time. The problem is that the code is too slow to be used in an real time application. I tried using np.einsum and although it did speed up things a little it wasn't enough for my application.

So, now I'm trying to speed things up using a GPU but I'm not sure how. I looked into OpenCL and multiplying two matrices seems fine, but I'm not sure how to do it with complex numbers and matrices with more than two dimensions(I guess using a for loop to send two axis at a time for the GPU). I also don't know how to do something like array.sum(axis). I have worked with a GPU before using OpenGL.

I would have no problem using C to optimize the code if needed or using something besides OpenCL, as long as it works with more than one GPU manufacturer(so no CUDA).

Edit: Running cProfile:

import cProfile

def f():
    Signal = np.fft.fft(signal, axis=1)[:,:128,:]
    B = Signal[..., None, None] * A[None,...]
    B = B.sum(1)
    b = np.fft.ifft(B, axis=1).real
    b_squared = b**2
    res = b_squared.sum(1)

cProfile.run("f()", sort="cumtime")

Output:

         56 function calls (52 primitive calls) in 1.555 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.555    1.555 {built-in method builtins.exec}
        1    0.005    0.005    1.555    1.555 <string>:1(<module>)
        1    1.240    1.240    1.550    1.550 <ipython-input-10-d4613cd45f64>:3(f)
        2    0.000    0.000    0.263    0.131 {method 'sum' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.263    0.131 _methods.py:45(_sum)
        2    0.263    0.131    0.263    0.131 {method 'reduce' of 'numpy.ufunc' objects}
      6/2    0.000    0.000    0.047    0.024 {built-in method numpy.core._multiarray_umath.implement_array_function}
        2    0.000    0.000    0.047    0.024 _pocketfft.py:49(_raw_fft)
        2    0.047    0.023    0.047    0.023 {built-in method numpy.fft._pocketfft_internal.execute}
        1    0.000    0.000    0.041    0.041 <__array_function__ internals>:2(ifft)
        1    0.000    0.000    0.041    0.041 _pocketfft.py:189(ifft)
        1    0.000    0.000    0.006    0.006 <__array_function__ internals>:2(fft)
        1    0.000    0.000    0.006    0.006 _pocketfft.py:95(fft)
        4    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(swapaxes)
        4    0.000    0.000    0.000    0.000 fromnumeric.py:550(swapaxes)
        4    0.000    0.000    0.000    0.000 fromnumeric.py:52(_wrapfunc)
        4    0.000    0.000    0.000    0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.000    0.000 _asarray.py:14(asarray)
        4    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        2    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        4    0.000    0.000    0.000    0.000 fromnumeric.py:546(_swapaxes_dispatcher)
        2    0.000    0.000    0.000    0.000 _pocketfft.py:91(_fft_dispatcher)

CodePudding user response：

Most of the number crunching libraries that interface well with the Python ecosystem have tight dependencies on Nvidia's ecosystem, but this is changing slowly. Here are some things you could try:

Profile your code. The built-in profiler (cProfile) is probably a good place to start, I'm also a fan of snakeviz for looking at performance traces. This will actually tell you if NumPy's FFT implementation is what's blocking you. Is memory being allocated efficiently? Is there some way where you could hand more data off to np.fft.ifft? How much time is Python taking to read the signal from its source and convert it into a Numpy array?
Numba is a JIT which takes Python code and further optimizes it, or compiles it to either CUDA or ROCm (AMD). I'm not sure how far off you are from your performance goals, but perhaps this could help.
Here is a list of C libraries to try.

Honestly, I'm kind of surprised that the NumPy build distributed on the PyPI isn't fast enough for real-time use. I'll post an update to my comment where I benchmark this code snippet.

UPDATE: Here's an implementation which uses a multiprocessing.Pool, and the np.einsum trick kindly provided by @hpaulj.

import time
import numpy as np
import multiprocessing

NUM_RUNS = 50

A = np.random.random((128, 4, 46, 23))   np.random.random((128, 4, 46, 23)) * 1j
signal = np.random.random((355, 256, 4))   np.random.random((355, 256, 4)) * 1j


def worker(signal_chunk: np.ndarray, a_chunk: np.ndarray) -> np.ndarray:
    return (
        np.fft.ifft(np.einsum("ijk,jklm->iklm", fft_chunk, a_chunk), axis=1).real ** 2
    )


# old code
serial_times = []
for _ in range(NUM_RUNS):
    start = time.monotonic()
    Signal = np.fft.fft(signal, axis=1)[:, :128, :]
    B = Signal[..., None, None] * A[None, ...]
    B = B.sum(1)
    b = np.fft.ifft(B, axis=1).real
    b_squared = b ** 2
    res = b_squared.sum(1)
    serial_times.append(time.monotonic() - start)


parallel_times = []
# My CPU is hyperthreaded, so I'm only spawning workers for the amount of physical cores
with multiprocessing.Pool(multiprocessing.cpu_count() // 2) as p:
    for _ in range(NUM_RUNS):
        start = time.monotonic()
        # Get the FFT of the whole sample before splitting
        transformed = np.fft.fft(signal, axis=1)[:, :128, :]
        a_chunks = np.split(A, A.shape[0] // multiprocessing.cpu_count(), axis=0)
        signal_chunks = np.split(
            transformed, transformed.shape[1] // multiprocessing.cpu_count(), axis=1
        )
        res = np.sum(np.hstack(p.starmap(worker, zip(signal_chunks, a_chunks))), axis=1)
        parallel_times.append(time.monotonic() - start)

print(
    f"ORIGINAL AVG TIME: {np.mean(serial_times):.3f}\t POOLED TIME: {np.mean(parallel_times):.3f}"
)

And here are the results I'm getting on a Ryzen 3700X (8 cores, 16 threads):

ORIGINAL AVG TIME: 0.897         POOLED TIME: 0.315

I'd've loved to have offered you an FFT library written in OpenCL, but I'm not sure whether you'd have to write the Python bridge yourself (more code) or whether you'd trust the first implementation you'd come across on GitHub. If you're willing to give into CUDA's vendor lock-in, Nvidia provides an "almost drop in replacement" for NumPy called CuPy, and it has FFT and IFFT kernels., Hope this helps!

CodePudding user response：

With your arrays:

In [42]: Signal.shape
Out[42]: (355, 128, 4)
In [43]: A.shape
Out[43]: (128, 4, 46, 23)

The B calc takes minutes on my modest machine:

In [44]: B = (Signal[..., None, None] * A[None,...]).sum(1)
In [45]: B.shape
Out[45]: (355, 4, 46, 23)

einsum is much faster:

In [46]: B2=np.einsum('ijk,jklm->iklm',Signal,A)
In [47]: np.allclose(B,B2)
Out[47]: True
In [48]: timeit B2=np.einsum('ijk,jklm->iklm',Signal,A)
1.05 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

reworking the dimensions to move the 128 to the classic dot positions:

In [49]: B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2))
In [50]: B21.shape
Out[50]: (355, 4, 1058)
In [51]: timeit B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2)
    ...: )
1.04 s ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With a bit more tweaking I can use matmul/@ and cut the time in half:

In [52]: B3=Signal.transpose(0,2,1)[:,:,None,:]@(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
In [53]: timeit B3=Signal.transpose(0,2,1)[:,:,None,:]@(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
497 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: B3.shape
Out[54]: (355, 4, 1, 1058)

In [56]: np.allclose(B, B3[:,:,0,:].reshape(B.shape))
Out[56]: True

Casting the arrays to the matmul format took a fair bit of experimentation. matmul makes optimal use of BLAS-like libraries. You may improve speed by installing better libraries.