I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0
to M
, the second one frommy_first_coord
to my_last_coord
. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count
(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code
#pragma omp parallel
...
for (int i = 0; i < M; i ) { //All iterate from i to M
# pragma omp for nowait
for (int coord = 0; coord < N; coord ) { //each works on its portion of coords
centroids[points[i].cluster].accumulator.coordinates[coord] = points[i].coordinates[coord];
}
}
I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas
/*M is global,
first_nn and last_nn are local*/
for (long i = 0; i < M; i )
for(long coord = first_nn; coord <= last_nn; coord )
centroids[points[i].cluster].accumulator.coordinates[coord] = points[i].coordinates[coord];
I hope that it is clear enough. Thank you
Edit:
I'm using gcc 12.2.0. By adding the -O3
flag times have improved.
With larger inputs the difference is speedup between 4 and 8 threads is more significant.
CodePudding user response:
Your comment indicates that you are worried about speedup.
- How many physical cores does your processor have? Try every thread count from 1 to that number.
- Do not use hyperthreads
- You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
- You could try setting
OMP_PROC_BIND=true
which prevents the OS from migrating your threads. That can improve cache usage. - You have some sort of indirect addressing going on with the
i
variable so further memory effects related to the TLB may make your parallel code not scale optimally.
But start with point 3 and report.
CodePudding user response:
I solved my problem thanks to the comments.