I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?
The system:
- Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)
Example code:
unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);
// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i ){
array[i]= i;
}
// Time start
// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction( : sum)
for (unsigned long i = 0; i < array_length; i )
{
if (array[i] < 4)
{
sum = array[i];
}
}
// Time End
CodePudding user response:
There are two aspects at work here:
NUMA allocation
In a NUMA system, memory pages can be local to a CPU or remote. By default Linux allocates memory in a first-touch policy, meaning the first write access to a memory page determines on which node the page is physically allocated.
If your malloc is large enough that new memory is requested from the OS (instead of reusing existing heap memory), this first touch will happen in the initialization. Because you use static scheduling for OpenMP, the same thread will use the memory that initialized it. Therefore, unless the thread gets migrated to a different CPU, which is unlikely, the memory will be local.
If you don't parallelize the initialization, the memory will end up local to the main thread which will be worse for threads that are on different sockets.
Note that Windows doesn't use a first-touch policy (AFAIK). So this behavior is not portable.
Caching
The same as above also applies to caches. The initialization will put array elements into the cache of the CPU doing it. If the same CPU accesses the memory during the second phase, it will be cache-hot and ready to use.
CodePudding user response:
First of all, the explanation by @Homer512 is completely correct.
Now I note that you marked this question "C ", but you're using malloc
for your array. That is bad style in C : you should use std::vector
for your simple containers, std::array
for small enough ones.
And then you have a big problem because std::vector
uses "value initialization": the whole array is automatically filled with zeroes, and there is no way you can do this with OpenMP.
Here is a big trick:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
};
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i )
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;