OpenMP make threads wait for the next parallel section-CodePudding

Once the program has executed a parallel section, all the threads except for the master get destroyed. As the program enters another parallel section, threads are spawned again. All this routine causes some overhead which may significantly limit the efficiency of multiprocessing.

Consider a nested loop with the inner one being parallelized:

for (int i = 0; i < outer_loops; i  ){
    // some code which does not affect the execution time noticeably
#pragma omp parallel for
    for (int j = 0; j < inner_loops; j  ) {
        // some code to run through with multiple threads
    }
}

Is there a way to avoid slave threads to be spawned anew each time the program enters the inner loop? For example, once the inner loop is done, the master thread takes care of the outer loop, but the slave threads do not get destroyed to be spawned again, they just wait for the next iteration of the inner loop.

In my case, there is no way I can include the outer loop into the parallel section as it must be sequential. I'm trying to minimize the overhead caused by thread spawn.

CodePudding user response：

Once the program has executed a parallel section, all the threads except for the master get destroyed. As the program enters another parallel section, threads are spawned again.

This is wrong in practice. An OpenMP implementation can keep threads alive even outside a parallel section and recycle the threads. In fact, GOMP (OpenMP runtime of GCC) and IOMP (OpenMP runtime of Clang and ICC) does this optimization (as long as some constraints are fulfilled like having the same number of threads in the different sections).

All this routine causes some overhead which may significantly limit the efficiency of multiprocessing.

Most of the overheads does not comes from the creation of the threads but the initialisation of some internal data structures, the weak up of threads (if needed regarding your settings), their synchronisation and the actual worksharing operation. With an active waiting policy (see OMP_WAIT_POLICY), the initialisation of the internal data structure should be the most expensive part though it is generally pretty fast.

Is there a way to avoid slave threads to be spawned anew each time the program enters the inner loop?

You can using one parallel section and tell OpenMP that a part is only executed by the master thread:

#pragma omp parallel
for (int i = 0; i < outer_loops; i  ){
    #pragma omp master
    {
        // some code which does not affect the execution time noticeably
    }

    // This is certainly required in your case so that workers can 
    // read the results of the master thread.
    #pragma omp barrier

    // Worksharing construct with an implicit barrier at the end
    #pragma omp for
    for (int j = 0; j < inner_loops; j  ) {
        // some code to run through with multiple threads
    }
}

The two barrier slow down the execution but they are mandatory in your case. If the sequential code of the outer loop is not dependent of the result of the inner parallel loop, then you can use OpenMP tasks to speed up the computation (it reduce the need for expensive synchronizations and provide more parallelism).

Note that this code is fundamentally limited by the speed of the sequential code. The Amdahl's law state that this strongly limit the scalability of the approach. Synchronizations and worksharing overheads tends to slow down the serial part reducing scalability. In the most critical case, the parallel code can be slower. The only solutions to solve this is to reduce the time taken by the serial part OR to work on more data (see the Gustafson's_law).

CodePudding user response：

The code in your other question had a number of obvious problems, so I coded it out. I'm the exact same code 10 thousand times, with a variable number of terations outside and inside the parallel region:

  const int n_iterations = 10000;
  for ( int outer=1; outer<=n_iterations; outer*=10 ) {
#   pragma omp parallel for
    for (int i=0; i<N; i  )
      y[i] = x[i] = 0.;
    x[0] = 0; x[N-1] = 1.;

    double s=0; int its=0;
    double timer = omp_get_wtime();
    int n_inner = n_iterations/outer;
    for ( int it_outer=0; it_outer<outer; it_outer   ) {
#     pragma omp parallel
      {
        for ( int it_inner=0; it_inner<n_inner; it_inner   ) {
#         pragma omp master
          its  ;
#         pragma omp for
          for (int i=1; i<N-1; i  )
            y[i] = ( x[i-1] x[i] x[i 1] )/3.;
#         pragma omp for
          for (int i=1; i<N-1; i  )
            x[i] = y[i];
        } // end inner iteration
      } // end omp parallel
    } // end outer iteration
#   pragma omp parallel for reduction( :s)
    for (int i=0; i<N; i  )
      s  = y[i];
    timer = omp_get_wtime()-timer;
    printf("outer=M, n_inner=M, its=M, time=%7.4f, checksum=%e\n",
           outer,n_inner,its,timer,s);
  }

Results:

outer=   1, n_inner=10000, its=10000, time= 0.0766, checksum=6.464904e 01
outer=  10, n_inner=1000, its=10000, time= 0.0777, checksum=6.464904e 01
outer= 100, n_inner= 100, its=10000, time= 0.0776, checksum=6.464904e 01
outer=1000, n_inner=  10, its=10000, time= 0.0807, checksum=6.464904e 01
outer=10000, n_inner=   1, its=10000, time= 0.1142, checksum=6.464904e 01

Timing depends a little on the array size, here I used about a Megabyte. You see that the thread overhead is very little. Doing 10 thousand times thread "creation" increases the time by 50 percent. Doing thousand times only 5 percent.