OpenMP for loop sequentailly in parallel-CodePudding

I'm looking to multithread a for loop using OpenMP. As I understood when you do a loop like;

    #pragma omp parallel num_threads(NTHREADS)
    for (size_t i = 0; i < length; i  )
    {
    ...

All the threads will just grab an i and move on with their work. For my implementation, I need to have it that they work "sequentially" in parallel. By that I mean that e.g., for a length of 800 with 8 threads, I need thread 1 to work on 0 to 99, thread 2 to work on 100-199 and so on.

Is this possible with OpenMP?

CodePudding user response：

Your desired behavior is the default. The loop can be scheduled in several ways, and schedule(static) is the default: the loop gets divided in blocks, and the first thread takes the first block, et cetera.

So your initial understanding was wrong: a thread does not grab an index, but a block.

Just to note: if you want a thread to grab a smaller block, you can specify schedule(static,8) or whatever number suits you, but less than 8 runs into cache performance problems.

CodePudding user response：

From OpenMP specification:

schedule([modifier [, modifier]:]kind[, chunk_size])

When kind is static, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.

When the monotonic modifier is specified then each thread executes the chunks that it is assigned in increasing logical iteration order.

For a team of p threads and a loop of n iterations, let n∕p be the integer q that satisfies n = p * q - r, with 0 <= r < p. One compliant implementation of the static schedule (with no specified chunk_size) would behave as though chunk_size had been specified with value q. Another compliant implementation would assign q iterations to the first p - r threads, and q - 1 iterations to the remaining r threads. This illustrates why a conforming program must not rely on the details of a particular implementation.

The default schedule is taken from def-sched-var and it is implementation defined, so if your program relies on it, define it explicitly:

schedule(monotonic:static,chunk_size)

In this case it is clearly defined how your program behaves and does not depend on the implementation at all. Note also that your code should not depend on the number of threads, because OpenMP does not guarantee that your parallel region gets all the requested/available threads. Note also that the monotonic modifier is the default in the case of static schedule, so you do not have to state it explicitly.

So, you should use something like

#pragma omp parallel num_threads(NTHREADS)
{
  #pragma omp single
  {
    int nthreads=omp_get_num_threads();
      //calculate chunk_size based on nthreads
      chunk_size=.....
  }

  #pragma omp for schedule(static,chunk_size) 
  for(...)
}