OpenMP parallel loop much slower than regular loop-CodePudding

The whole program has been shrunk to a simple test:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i  ) j[i % 4]  = 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i  ) k[omp_get_thread_num()]  = 2;
    std::cout << std::time(nullptr) - time << std::endl;

In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?

Windows 10 x64, CPU Intel Core i5 10300H (4 cores)

CodePudding user response：

As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:

You will likely see overflows, both in your loops variable and in all of your j and k tables;
Your timer isn't really the best choice ever (admittedly, that is a bit pedantic from my part in this case);
You do not make use of the values you computed, which allows the compiler to completely ignore the various computations;
Your two loops are not equivalent and won't give the same results; In order to get it right, I went back to your original i%4 formula and added a schedule( static, 1) clause. This is not a proper way of doing it, but it was only to get the expected results without using the correct reduction clause.

Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a reduction clause.

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i   ) {
         j[i % 4]  = 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i   ) {
        k[i%4]  = 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction(  : l[0:4] )
    for ( long long i = 0; i < loops; i   ) {
        l[i%4]  = 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

Compiling and running without optimizations gives on my laptop:

$ g   -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

Which speaks for itself in regard to the improvement the reduction clause brings. Now, enabling optimizations from the compiler gives a more mitigated picture:

$ g   -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous) k[omp_get_thread_num()], there was no time difference with and without the reduction clause, showing that the compiler was able to avoid the issue.