Can you GPU accelerate a simple math equation in python such as: y = 1/x and how do you do it?-CodePudding

Can I use the cores in my GPU to accelerate this problem and speed it up? If so, how do I do it? At around 10 trillion a single thread in my CPU just can't do it, which is why I want to accelerate it with my GPU. I am also interested in seeing any multi threaded CPU answers, but I really want to see it done on the GPU. Ideally I'd like the answers to be as simple as possible.

My code:

y=0

for x in range(1, 10000000000):
    y  = 1/x

print(y)

CodePudding user response：

Yes, this operation can be done one GPU using a basic parallel reduction. In fact, it is well suited on GPUs since it is mainly embarrassingly parallel and heavily makes use of floating-point/integer operations.

Note that the convergence of such basic sequence is AFAIK well known analytically (as pointed out by @jfaccioni) and thus you should prefer analytical solutions that are generally far cheaper to compute. Note also that client-side GPUs are not great to efficiently compute 64-bit floating-point (FP) numbers so you should generally use 32-bit ones so to see a speed-up at the expense of a lower precision. That being said, server-side GPUs can compute 64-bit FP numbers efficiently so the best solution depends of the hardware you actually have.

Nvidia GPU are usually programmed with CUDA which is pretty low-level compared to a basic pure-Python code. There are Python wrapper and higher-level library but in this case most are not efficient since they will cause unnecessary memory loads/stores or other overheads. AFAIK, PyCUDA and Numba are likely the best tool for that yet. If your GPU is not an Nvidia GPU, then you can use libraries based on OpenCL (as CUDA is not really well supported on non-Nvidia GPUs yet).

Numba support high-level reduction so it can be done very easily (note that Numba use CUDA internally so you need an Nvidia GPU):

from numba import cuda

# "out" is an array with 1 FP item that must be initialized to 0.0
@cuda.jit
def vec_add(out):
    x = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bdx = cuda.blockDim.x
    i = bx * bdx   x
    if i < 10_000_000_000-1:
        numba.cuda.atomic.add(out, 0, i 1)

This is only the GPU kernel, not the whole code. For more information about how to run it please read the documentation. In general, one need to care about data transfer, allocation kernel dependencies, streams, etc. Keep in mind that GPUs are hard to program (efficiently). This kernel is simple but clearly not optimal, especially on old GPU without hardware atomic acceleration units. To write a faster kernel, you need to perform a local reduction using an inner loop. Also note that C is much better to write efficient kernel codes, especially with libraries like CUB (based on CUDA) which supports iterators and high-level flexible efficient primitives.

Note that Numba can also be used to implement fast parallel CPU codes. Here is an example:

import numba as nb

@nb.njit('float64(int64)', fastmath=True, parallel=True)
def compute(limit):
    y = 0.0
    for x in nb.prange(1, limit):
        y  = 1 / x
    return y

print(compute(10000000000))

This takes only 0.6 seconds on my 10-core CPU machine. CPU codes have the benefit to be simpler, easier to maintain, more portable and more flexible despite being possibly slower.

CodePudding user response：

(Nvidia CUDA) GPU's can be used with the proper modules installed. However a standard multiprocessing solution (not multithreading since this is a computation-only task) is easy enough to achieve, and reasonably efficient:

import multiprocessing as mp

def inv_sum(start):
  return sum(1/x for x in range(start,start 1000000))

def main():
  pool = mp.Pool(mp.cpu_count())
  result = sum(pool.map(inv_sum, range(1,1000000000,1000000)))
  print(result)

if __name__ == "__main__":
  main()

I didn't dare to test run it on one trillion, but one billion on my 8-core i-5 laptop runs in about 20 seconds