Parallelizing the loop increases the execution time-CodePudding

I have this fragment of code which is the solution of newtons equation of motion and gives the position and velocities of the the particles for the next step. deltat_s and dtSqr are constants and X[], Z[], V_X[], V_Z[] are double type array holding the position and velocities values respectively a_x[], a_z[] are holding the acceleration values. where X and Z represent the x and z components of the vector. The fragment of code basically has two for loops as given below. Here the function ComputeForces() calculates the force and updates the a_x[] and a_z[] values and after updating, V_Z and V_X are calculated again. V_X[] and V_Z[] are updatd in the 1st loop because ComputeForces() needs the updated value to calculate a_x[] and a_z[]

for(i=0; i<N_TOTAL; i  )
  {
    X[i] = X[i]   deltat_s * V_X[i]   0.5 * dtSqr * a_x[i];
    V_X[i] = V_X[i]   0.5 * deltat_s * a_x[i];
    Z[i] = Z[i]   deltat_s * V_Z[i]   0.5 * dtSqr * a_z[i];
    V_Z[i] = V_Z[i]   0.5 * deltat_s * a_z[i];
  }

ComputeForces ();

for(i=0; i<N_TOTAL; i  )
  {
    V_X[i] = V_X[i]   0.5 * deltat_s * a_x[i];
    V_Z[i] = V_Z[i]   0.5 * deltat_s * a_z[i];
  }

I have parallelized the code as follows:

#pragma omp parallel for ordered schedule (guided, 4) default(shared) private(i)
for(i=0; i<N_TOTAL; i  )
      {
        X[i] = X[i]   deltat_s * V_X[i]   0.5 * dtSqr * a_x[i];
        V_X[i] = V_X[i]   0.5 * deltat_s * a_x[i];
        Z[i] = Z[i]   deltat_s * V_Z[i]   0.5 * dtSqr * a_z[i];
        V_Z[i] = V_Z[i]   0.5 * deltat_s * a_z[i];
      }

    ComputeForces ();

    #pragma omp parallel for schedule (guided, 16) default(shared) private(i)
    for(i=0; i<N_TOTAL; i  )
      {
        V_X[i] = V_X[i]   0.5 * deltat_s * a_x[i];
        V_Z[i] = V_Z[i]   0.5 * deltat_s * a_z[i];
      }

The issue i am facing is that when I parallelize the loops, it is taking more time than the serial code to execute. The time i obtained are as follows.

serial code : 3.546000e-02

parallel code : 4.632966e-02

It is a slight difference but when the values are updated for longer duration of time. in other words when the loops will be executed for say 1 X 10^6 times then this small increase in time will tend to add up and slow the code. I am unable to find the issue. I guess there could be a false sharing of the V_X[] and V_Z[] data. I have tried to change the chunk size to reduce false sharing i have tried different scheduling but unable to make it faster than the serial code.

Edit:

To check that i have given a run of 30000 steps. In each step the above code is evaluated once followed by few other calculations in serial. The time taken are as follows. Serial code : 1.102543e 03 and parallel code : 1.363923e 03. This is the problem i was talking about.

The processor I am using is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz X 40. And the value of N_TOTAL = 1000

CodePudding user response：

The are several issue with the parallelization method.

First of all, schedule(guided, 4) can cause a false-sharing issues due to shared cache-lines between threads. Indeed, double-precision (DP) floating-point (FP) numbers should take 8 bytes in memory and a cache line is 64 bytes on your target processor (not to mention some Intel processor prefetch data by 128 bytes). Using a bigger scheduling grain like >= 16 should help.

Moreover, a guided schedule cause issue on NUMA systems like your target platform. Put it shortly, each processor has its own RAM, and while other processors of the same computing node can access to its RAM, remote accesses are more expensive and can be imbalanced. Indeed, it cause core to perform non-local access in a non-deterministic way. In the worst case, data may lies in the same NUMA node causing 1 NUMA node to be saturated and the other to be unused. The NUMA policy must be carefully controled to mitigate/avoid NUMA effects. Since your code is memory bound, this effect can make parallelisation completely useless (using more core when the RAM is saturated does not make a code run faster). It is better to use a static schedule with data well distributed amongst NUMA nodes. A common solution is to fill arrays in parallel during the initialization so the first touch policy (generally the default one) can distribute the memory pages properly (assuming the schedule clause is the same). Alternatively, you can use numactl. You can find more information in this post and this one. Besides node that binding threads often has a significant impact on performance in this context (using 4 core of 2 NUMA nodes should result in better performance than 8 core lying on the same NUMA node assuming your application supports well NUMA systems).

Finally, the number of iteration is certainly too small for multiple threads to be useful. Indeed, the loop can be vectorized using the AVX and FMA3 instruction sets that can compute 4 double precision numbers per instruction. The target processor can compute 2 FMA instruction per cycle. This means 8 DP FP numbers per cycle. Thus, I expect the computation time to be clearly less than a microsecond. Most of the time should be taken by the memory accesses. This is especially true in parallel since the guided schedule causes data to move between caches of different core rather than staying in the L1/L2 cache (which are significantly faster than the shared L3 cache). Moreover, distributing the work between cores take some time (especially when the target cores are lying on another processor). One should also note that each parallel for ends with an implicit barrier that also add some overhead. All these overheads are generally significantly more than 1 us (this is very dependent of the target machine and OpenMP runtime though).

Note that AVX/AVX2 is only used if your code is compiled with -march=native or -mavx or similar flags. By default, compilers use older instruction sets which are slower but compatible with (much) older machines.

CodePudding user response：

There are two very likely explanations to your observation:

The amount of calculations within each parallel region is too small (a loop executed 30000 times with only a few simple operations is nothing nowadays). OpenMP has some overheads that can cancel the benefits of a parallel execution in such cases
Your code is memory bound, or it becomes memory bound once parallelized on 8 threads. "Memory bound" means that the bottleneck is not the CPU load, but the data transfer rate between the memory (RAM) and the CPU. In that case no performance improvement can be obtained with parallelization. When working on arrays, a code is memory bound when it performs only few and simple operations per array element.