Issue in parallelising inner loop of a nested for in OpenMP-CodePudding

I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0 to M , the second one frommy_first_coord to my_last_coord. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code

#pragma omp parallel
... 
    for (int i = 0; i < M; i  ) { //All iterate from i to M
#       pragma omp for nowait
        for (int coord = 0; coord < N; coord  ) { //each works on its portion of coords

            centroids[points[i].cluster].accumulator.coordinates[coord]  = points[i].coordinates[coord];
        }
    }

I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas

/*M is global, 
first_nn and last_nn are local*/
        for (long i = 0; i < M; i  )
            for(long coord = first_nn; coord <= last_nn; coord  ) 
            centroids[points[i].cluster].accumulator.coordinates[coord]  = points[i].coordinates[coord];

I hope that it is clear enough. Thank you

Edit: I'm using gcc 12.2.0. By adding the -O3 flag times have improved. With larger inputs the difference is speedup between 4 and 8 threads is more significant.

CodePudding user response：

Your comment indicates that you are worried about speedup.

How many physical cores does your processor have? Try every thread count from 1 to that number.
Do not use hyperthreads
You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
You could try setting OMP_PROC_BIND=true which prevents the OS from migrating your threads. That can improve cache usage.
You have some sort of indirect addressing going on with the i variable so further memory effects related to the TLB may make your parallel code not scale optimally.

But start with point 3 and report.

CodePudding user response：

I solved my problem thanks to the comments.