How to fully parallelise sequential inner loops in Fortran with OpenMP-CodePudding

I'm attempting to parallelise some old fixed format Fortran code with OpenMP. I can't figure out how to fully parallelise the following structure of nested loops, with one outer loop and 2 sequential inner loops:

      do y = 1,ny
        do x = 1,nx
          calculation 1
        enddo
        intermediate calculation (calculation 1)
        do x = 1,nx
          calculation 2 (intermediate calculation)
        enddo
      enddo

calculation 2 is such that it cannot be included in the first inner loop, and must be included in the outer loop as opposed to in a separate outer loop. This is due to a dependency on an intermediate calculation, itself dependent on all values of calculation 1

I am compiling with gfortran, and setting the environment variable OMP_NUM_THREADS=4. The following code illustrates the 3 approaches I've tested:

c     Test of using OpenMP to parallelise sequential inner loops and an
c       outer loop
c     Use export OMP_NUM_THREADS=4
      program parallelTest

c     declarations
      implicit none
      integer nx, ny
      parameter (nx = 2, ny = 5)
      integer omp_get_thread_num, i, j, A(nx,ny), B(nx,ny), C(nx,ny),
     >    D(nx,ny), E(nx,ny), F(nx,ny)

c     initialisation
      A = 7
      B = 7
      C = 7
      D = 7
      E = 7
      F = 7

c     attempt 1: just executes first loop, 2nd loop is ignored
c$omp parallel do shared(A,B) private(i,j) schedule(static) collapse(2)
      do j = 1,ny
        do i = 1,nx
          A(i,j) = omp_get_thread_num()
        enddo
        do i = 1,nx
          B(i,j) = omp_get_thread_num()
        enddo
      enddo
c$omp end parallel do

c     attempt 2: only parallelises outer loop
c$omp parallel do shared(C,D) private(i,j) schedule(static)
      do j = 1,ny
        do i = 1,nx
          C(i,j) = omp_get_thread_num()
        enddo
        do i = 1,nx
          D(i,j) = omp_get_thread_num()
        enddo
      enddo
c$omp end parallel do

c     attempt 3: only parallelises inner loops
c$omp parallel shared(E,F) private(i,j)
      do j = 1,ny
c$omp   do schedule(static)
        do i = 1,nx
          E(i,j) = omp_get_thread_num()
        enddo
c$omp   end do
c$omp   do schedule(static)
        do i = 1,nx
          F(i,j) = omp_get_thread_num()
        enddo
c$omp   end do
      enddo
c$omp end parallel

c     print output to terminal
      do i = 1,nx
        print *, 'A(', i, ',:) = ', A(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'B(', i, ',:) = ', B(i,:)
      enddo
      print *
      print *
      do i = 1,nx
        print *, 'C(', i, ',:) = ', C(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'D(', i, ',:) = ', D(i,:)
      enddo
      print *
      print *
      do i = 1,nx
        print *, 'E(', i, ',:) = ', E(i,:)
      enddo
      print *
      do i = 1,nx
        print *, 'F(', i, ',:) = ', F(i,:)
      enddo

      end

This gives the following output:

 A(           1 ,:) =            0           0           1           2           3
 A(           2 ,:) =            0           1           1           2           3

 B(           1 ,:) =            7           7           7           7           7
 B(           2 ,:) =            7           7           7           7           7


 C(           1 ,:) =            0           0           1           2           3
 C(           2 ,:) =            0           0           1           2           3

 D(           1 ,:) =            0           0           1           2           3
 D(           2 ,:) =            0           0           1           2           3


 E(           1 ,:) =            0           0           0           0           0
 E(           2 ,:) =            1           1           1           1           1

 F(           1 ,:) =            0           0           0           0           0
 F(           2 ,:) =            1           1           1           1           1

In approach 1, the outer loop and first inner loop are parallelised as I expect, but the 2nd loop doesn't execute (presumably because the do doesn't immediately follow a c$omp do?). In approach 2, only the outer loop is parallelised. In approach 3, only the inner loops are parallelised.

My question is: how do I get matrix B to be the same as matrix A? This seems like it should be a straightforward task; I'm assuming there's an OpenMP clause or structure I'm not utilising, but would much appreciate some pointing in the right direction of what to search for (new to OpenMP!).

CodePudding user response：

If the information flow is as you have described it, I would re-write your loops as

do y = 1,ny
  do x = 1,nx
    calculation 1
  enddo
enddo

do y = 1,ny
  intermediate calculation (calculation 1)
enddo

do y = 1,ny
  do x = 1,nx
    calculation 2 (intermediate calculation)
  enddo
enddo

And then parallelise each y loop separately.