Is there any way to boost matrix multiplication using multiple GPUs?-CodePudding

I want to multiply two huge matrices, size is more than 100,000 rows and columns. I run the task on a server that has several GPUs, let's say 8 RTX 3090 GPUs, their ram size is 24GB, apparently, the matrix cannot fit in it, so I cannot use cupy.array directly. Here is my idea:

store two matrices in the main memory, using numpy.array
cut them in blocks, maybe 4 blocks or 9 blocks
send blocks to GPUs, compute it
retrieve resulting blocks to main memory, reassemble them

Here are my questions:

Is there any library in python that can implement my idea automatically?
I want to use the GPUs in parallel, I think the bottleneck is the data transportation between main memory and GPU memory, which is numpy.array -> cupy.array. Can I move data in parallel using the multiprocessing library? How about the PCIe bus?

NOTE:

assume the matrices are not sparse.

[[a1,b1],   *   [[a2,b2],   =   [[a1a2 b1c2, a1b2 b1d2],
 [c1,d1]]        [c2,d2]]        [c1a2 d1c2, c1b2 d1d2]]

import cupy as cp
import numpy as np

N = 27000
P = 27000

# init two matrices
source1 = np.random.random((N * 2, P * 2))
source2 = np.random.random((N * 2, P * 2))

# cut them in blocks
a1 = source1[:N, :P]
b1 = source1[:N, P:]
c1 = source1[N:, :P]
d1 = source1[N:, P:]

a2 = source2[:N, :P]
b2 = source2[:N, P:]
c2 = source2[N:, :P]
d2 = source2[N:, P:]

# move a1 and a2 to one gpu
m1 = cp.array(a1)
m2 = cp.array(a2)
r1 = m1 * m2
# free memory so that m3 and m4 can fit in gpu's ram
del m1
del m2

# move b1 and c2 to one gpu
m3 = cp.array(b1)
m4 = cp.array(c2)
r2 = m3 * m4
del m3
del m4
r1  = r2

CodePudding user response：

Python has a special library：https://documen.tician.de/pycuda/

Simple example：

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest-a*b

CodePudding user response：

Look into the "cuBLAS Multi-GPU Extension": https://developer.nvidia.com/cublas

You'll have to apply for the early access program. Existing python libraries probably won't take advantage of this extension but you may be able to just enable it once you update your CUDA libraries. You'd have to read the documentation once you have access.