C with OpenMP try to avoid the false sharing for tight looped array-CodePudding

I try to introduce OpenMP to my c code to improve the performance using a simple case as shown:

#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>

using std::cout;
using std::endl;

#define NUM 100000

int main()
{
    double data[NUM] __attribute__ ((aligned (128)));;

    #ifdef _OPENMP
        auto t1 = omp_get_wtime();
    #else
        auto t1 = std::chrono::steady_clock::now();
    #endif

    for(long int k=0; k<100000;   k)
    {

        #pragma omp parallel for schedule(static, 16) num_threads(4)
        for(long int i=0; i<NUM;   i)
        {
            data[i] = cos(sin(i*i  k*k));
        }
    }

    #ifdef _OPENMP
        auto t2 = omp_get_wtime();
        auto duration = t2 - t1;
        cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
    #else
        auto t2 = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
    #endif

    double tempsum = 0.;
    for(long int i=0; i<NUM;   i)
    {
        int nextind = (i == 0 ? 0 : i-1);
        tempsum  = i   sin(data[i])   cos(data[nextind]);
    }
    cout<<"Raw data sum: "<<tempsum<<endl;
    return 0;    
}

Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.

Build as

g   -o test test.cpp

g   -o test test.cpp -fopenmp

The program reported results as:

No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e 09

OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e 09

Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.

~~I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.~~

I update the new test results after increasing the loop size, the OpenMP runs faster as expected.

Thank you!

CodePudding user response：

Since you're writing C , use the C random number generator, which is threadsafe, unlike the C legacy one you're using.
Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
Your loop is pretty short.

CodePudding user response：

rand() is not thread-safe (see here). Use an array of C random-number generators instead, one for each thread. See std::uniform_int_distribution for details.
You can drop #ifdef _OPENMP variations in your code. In a Bash terminal, you can call your application as OMP_NUM_THREADS=1 test. See here for details.
So you can remove num_threads(4) as well because you can explicitly specify the amount of parallelism.
Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.

From here, I expect you will see:

The performance when you call OMP_NUM_THREADS=1 test is close to your non-OpenMP version.
The array of C RNG generators is faster than calling rand() from multiple threads.
The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.