I try to introduce OpenMP to my c code to improve the performance using a simple case as shown:
#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>
using std::cout;
using std::endl;
#define NUM 100000
int main()
{
double data[NUM] __attribute__ ((aligned (128)));;
#ifdef _OPENMP
auto t1 = omp_get_wtime();
#else
auto t1 = std::chrono::steady_clock::now();
#endif
for(long int k=0; k<100000; k)
{
#pragma omp parallel for schedule(static, 16) num_threads(4)
for(long int i=0; i<NUM; i)
{
data[i] = cos(sin(i*i k*k));
}
}
#ifdef _OPENMP
auto t2 = omp_get_wtime();
auto duration = t2 - t1;
cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
#else
auto t2 = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
#endif
double tempsum = 0.;
for(long int i=0; i<NUM; i)
{
int nextind = (i == 0 ? 0 : i-1);
tempsum = i sin(data[i]) cos(data[nextind]);
}
cout<<"Raw data sum: "<<tempsum<<endl;
return 0;
}
Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.
Build as
g -o test test.cpp
or
g -o test test.cpp -fopenmp
The program reported results as:
No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e 09
OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e 09
Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.
I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.
I update the new test results after increasing the loop size, the OpenMP runs faster as expected.
Thank you!
CodePudding user response:
- Since you're writing C , use the C random number generator, which is threadsafe, unlike the C legacy one you're using.
- Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
- You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
- Your loop is pretty short.
CodePudding user response:
rand()
is not thread-safe (see here). Use an array of C random-number generators instead, one for each thread. Seestd::uniform_int_distribution
for details.- You can drop
#ifdef _OPENMP
variations in your code. In a Bash terminal, you can call your application asOMP_NUM_THREADS=1 test
. See here for details. - So you can remove
num_threads(4)
as well because you can explicitly specify the amount of parallelism. - Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.
From here, I expect you will see:
- The performance when you call
OMP_NUM_THREADS=1 test
is close to your non-OpenMP version. - The array of C RNG generators is faster than calling
rand()
from multiple threads. - The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.