OpenMP basic temperature modelling with parallel processing taking longer than serial code-CodePudding

I am new to OpenMP and I have to complete this project where I need to resolve a 2D matrix using Jacobi iteration method to resolve a heat conductivity problem using OpenMP.

Essentially It is a plate with four walls at the sides which have fixed temperatures and I need to work out the unknown temperature values in the middle.

The code has been given to us and what I am expected to do is three simple things:

Time the serial code
Parallelise the serial code and compare
Further optimise the parallel code if possible

I have ran the serial code and parallelised the code to make a comparison. Before any optimisation, for some reason the serial code is consistently doing better. I can't help but think I am doing something wrong?

I will try compiler optimisations for both, but I expected parallel code to be faster. I have chosen a large problem size for the matrix including a 100 x 100, 300 x 300 array and every single time almost the serial code is doing better.

Funny thing is, the more threads I add, the slower it gets.

I understand for a small problem size it would be a larger overhead, but I thought this is a large enough problem size?

This is before any significant optimisation, am I doing something obviously wrong that makes it like this?

Here is the code:

Serial code:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[])
{

    int m; 
    int n;
    double tol;// = 0.0001;

    int i, j, iter;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);

    /**
     * @var t, tnew, diff, diffmax,
     * t is the old temprature array,tnew is the new array
     */
    double t[m 2][n 2], tnew[m 1][n 1], diff, difmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    printf("%d %d %lf\n",m,n, tol);

    start = omp_get_wtime();

    // initialise temperature array
    for (i=0; i <= m 1; i  ) {
        for (j=0; j <= n 1; j  ) {
            t[i][j] = 30.0;
        }
    }

    // fix boundary conditions
    for (i=1; i <= m; i  ) {
        t[i][0] = 33.0;
        t[i][n 1] = 42.0;
    }
    for (j=1; j <= n; j  ) {
        t[0][j] = 20.0;
        t[m 1][j] = 25.0;
    }

    // main loop
    iter = 0;
    difmax = 1000000.0;
    while (difmax > tol) {

        iter  ;

        // update temperature for next iteration
        for (i=1; i <= m; i  ) {
            for (j=1; j <= n; j  ) {
                tnew[i][j] = (t[i-1][j]   t[i 1][j]   t[i][j-1]   t[i][j 1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        difmax = 0.0;
        for (i=1; i <= m; i  ) {
            for (j=1; j <= n; j  ) {
                diff = fabs(tnew[i][j]-t[i][j]);
                if (diff > difmax) {
                    difmax = diff;
                }
                // copy new to old temperatures
                t[i][j] = tnew[i][j];
            }
        }

    }

    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  difmax = %9.11lf\n", iter, difmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m 1; i  ) {
    //  printf("\n");
    //  for (j=0; j <= n 1; j  ) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here is the parallel code:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>



int main(int argc, char *argv[])
{
    int m; 
    int n;
    double tol;// = 0.0001;

    /**
     * @brief Integer variables
     * @var i external loop (y column array) counter,
     * @var j internal loop (x row array counter) counter,
     * @var iter number of iterations,
     * @var numthreads number of threads
     */
    int i, j, iter, numThreads;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);
    numThreads = atoi(argv[4]);

    /**
     * @brief Double variables
     * @var t, tnew -> The variable that holds the temprature, the t is the old value and the tnew is the new value,
     * @var diff Measures the difference,
     * @var diffmax 
     * t is the temprature array, I guess it holds the matrix?
     * 
     */
    double t[m 2][n 2], tnew[m 1][n 1], diff, diffmax, privDiffmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    /**
     * @brief Print the problem size & the tolerance
     * This print statement can be there as it is not part of the parallel region
     * We also print the number of threads when printing the problem size & tolerance
     */
    //printf("%d %d %lf %d\n",m,n, tol, numThreads);
    omp_set_num_threads(numThreads);

    /**
     * @brief Initialise the timer
     * 
     */
    start = omp_get_wtime();

    /**
     * @brief Creating the parallel region:
     * Here both loop counters are private:
     */
    #pragma omp parallel private(i, j)
    {
        /**
         * @brief initialise temperature array
         * This can be in a parallel region by itself
         */
        #pragma omp for collapse(2) schedule(static)
        for (i=0; i <= m 1; i  ) {
            for (j=0; j <= n 1; j  ) {
                t[i][j] = 30.0;
            }
        }

        // fix boundary conditions
        #pragma omp for schedule(static)
        for (i=1; i <= m; i  ) {
            t[i][0] = 33.0;
            t[i][n 1] = 42.0;
        }

        #pragma omp for schedule(static)
        for (j=1; j <= n; j  ) {
            t[0][j] = 20.0;
            t[m 1][j] = 25.0;
        }

    }   

    // main loop
    iter = 0;
    diffmax = 1000000.0;

    while (diffmax > tol) {

        iter = iter   1;

        /**
         * @brief update temperature for next iteration
         * Here we have created a parallel for directive, this is the second parallel region
         */
        #pragma omp parallel for private(i, j) collapse(2) schedule(static)
        for (i=1; i <= m; i  ) {
            for (j=1; j <= n; j  ) {
                tnew[i][j] = (t[i-1][j]   t[i 1][j]   t[i][j-1]   t[i][j 1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        diffmax = 0.0;
        
        /**
         * @brief Third parallel region that compares the difference
         */
        #pragma omp parallel private(i, j, privDiffmax, diff)
        {
            privDiffmax = 0.0;
            #pragma omp for collapse(2) schedule(static)
            for (i=1; i <= m; i  ) {
                for (j=1; j <= n; j  ) {
                    diff = fabs(tnew[i][j]-t[i][j]);
                    if (diff > privDiffmax) {
                        privDiffmax = diff;
                    }
                    // copy new to old temperatures
                    t[i][j] = tnew[i][j];
                }
            }
            #pragma omp critical
            if (privDiffmax > diffmax)
            {
                diffmax = privDiffmax;
            }
        }
        

    }

    //Add timer for the end
    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  diffmax = %9.11lf\n", iter, diffmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m 1; i  ) {
    //  printf("\n");
    //  for (j=0; j <= n 1; j  ) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here are some of the benchmarks for serial code:

I have ran the code and tested it, I have commented out the print statements as I dont need to see that except for to test. The code runs fine but somehow it is slower than the serial code.

I have an 8 core Apple Mac M1

I am new to OpenMP and can't help but think I am missing something. Any advice would be appreciated

CodePudding user response：

The problem comes from the overhead of collapse(2) on Clang. I can reproduce the problem on Clang 13.0.1 in both -O0 and -O2 on a x86-64 i5-9600KF processor, but not on GCC 11.2.0. Clang generates an inefficient code when collapse(2) is used: it uses an expensive div/idiv instruction in the hot loop to be able to compute the i and j indices. Indeed, here is the assembly code of the hot loop of the sequential version (in -O1 to make the code more compact):

.LBB0_27:                               #   Parent Loop BB0_15 Depth=1
        movsd   xmm3, qword ptr [rbx   8*rsi]   # xmm3 = mem[0],zero
        movapd  xmm5, xmm3
        subsd   xmm5, qword ptr [rdi   8*rsi]
        andpd   xmm5, xmm0
        maxsd   xmm5, xmm2
        movsd   qword ptr [rdi   8*rsi], xmm3
        add     rsi, 1
        movapd  xmm2, xmm5
        cmp     r12, rsi
        jne     .LBB0_27

Here is the parallel counterpart (still in -O1):

.LBB3_4:
        mov     rax, rcx
        cqo
        idiv    r12                     # <-------------------
        shl     rax, 32
        add     rax, rdi
        sar     rax, 32
        mov     rbp, rax
        imul    rbp, r13                # <-------------------
        shl     rdx, 32
        add     rdx, rdi
        sar     rdx, 32
        add     rbp, rdx
        movsd   xmm2, qword ptr [r9   8*rbp]    # xmm2 = mem[0],zero
        imul    rax, r8                # <-------------------
        add     rax, rdx
        movapd  xmm3, xmm2
        subsd   xmm3, qword ptr [rsi   8*rax]
        andpd   xmm3, xmm0
        maxsd   xmm3, xmm1
        movsd   qword ptr [rsi   8*rax], xmm2
        add     rcx, 1
        movapd  xmm1, xmm3
        cmp     rbx, rcx
        jne     .LBB3_4

There are much more instructions to execute because the loop spent most of the time computing indices. You can fix this by not using the collapse clause. Theoretically, it should be better to provide more parallelism to compilers and runtime to let them make the best decisions, but in practice they are not optimal and they often need to be assisted/guided. Note that GCC uses a more efficient approach that consists in computing the division only once per block, so compilers can do this optimization.

Results

With `collapse(2)`:
- Sequential:  0.221358 seconds
- Parallel:    0.274861 seconds

Without:
- Sequential:  0.222201 seconds
- Parallel:    0.055710 seconds

Additional notes on performance

For better performance, consider using -O2 or even -O3. Also consider using -march=native. -ffast-math can also help if you do not use special floating-point (FP) values like NaN and you do not care about FP associativity. Not copying the array every time and using a double-buffering method also helps a lot (memory-bound codes do not scale well). Then consider reading a research paper for better performance (trapezoidal tiling can be used to boost the performance even more but this is quite complex to do). Also note that not using collapse(2) reduce the amount of parallelism which might be a problem on a processor with a lot of cores but in practice having a lot of cores operating on a small array tends to be slow anyway (because of false-sharing and communications).

Special note for M1 processors

M1 processors are based on a Big/Little architecture. Such an architecture is good to make sequential codes faster thanks to the few "big" cores that run fast (but also consume a lot of space and energy). However, running efficiently parallel core is harder because the "little" cores (which are energy efficient and small) are much slower than the big ones introducing load-imbalance issue if all kinds of cores are running simultaneously (IDK if this is the case on M1 by default). One solution is to control the execution so to only use the same kind of core. Another solution is to use a dynamic scheduling so to balance the work automatically at runtime (eg. using the clause schedule(guided) or even schedule(dynamic)). The second solution tends to add significant overhead and is known to cause other tricky issues on (NUMA-based) computing servers (or even recent AMD PC processors). It is also important to note that the scaling will not be linear with the number of threads because of the performance difference between the big and little cores. Such architecture is currently poorly efficiently supported by a lot of applications because of the above issues.