How to obtain performance enhancement while multiplying two sub-matrices?-CodePudding

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.

#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
    for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
        for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
            resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
        }
    }
}

The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.

What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?

You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.

CodePudding user response：

There are a few other OpenMP directives that can be used to optimize the performance of this matrix multiplication algorithm:

#pragma omp collapse : This directive allows you to collapse multiple nested loops into a single loop, reducing the overhead of creating and managing threads. You can use this directive to collapse the outer two loops, like this:

#pragma omp parallel for collapse(2)
    for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
            for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
                resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
            }
        }
    }

#pragma omp simd : This directive tells the compiler to vectorize the innermost loop, allowing the CPU to perform multiple operations in parallel using its SIMD (Single Instruction, Multiple Data) capabilities. This can be added to the innermost loop like this:

#pragma omp parallel for collapse(2)
    for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
            #pragma omp simd
            for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
                resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
            }
        }
    }

#pragma omp schedule : This directive allows you to control how the iterations of the loop are divided among the threads. You can use static, dynamic, guided, or auto scheduling to balance the workload among threads.

#pragma omp parallel for collapse(2) schedule(dynamic)
    for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
            #pragma omp simd
            for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
                resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
            }
        }
    }

#pragma omp atomic : This directive can be used to ensure that the operations on the resultMatrix(i,j) element are atomic and prevents any race conditions that may occur when multiple threads are trying to update the same element at the same time.

#pragma omp parallel for collapse(2) schedule(dynamic)
    for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
            #pragma omp simd
            for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
                #pragma omp atomic
                resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
            }
        }
    }

It's worth noting that the performance gain from these OpenMP directives will depend on the specific hardware, compiler, and other factors.

CodePudding user response：

For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, no so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two. Do a k,j interchange and then it is allowed.

Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.

CodePudding user response：

I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.

To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.

So, something like this:

select sub-matrix A
    #pragma omp parallel for
    select sub-matrix B
    for(size_t i = 0; i < matrixA.m_edgeSize; i  ) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k  ) {
            for(size_t j = 0; j < matrixA.m_edgeSize; j  ) {
                resultMatrix(i, j)  = matrixA(i, k) * matrixB(k, j);
            }
        }
    }

could work faster since whole data always stays in same thread (core).