I am trying to parallelize a cycle of 50 million iterations with several threads - first by 1, then by 4, 8 and 16. Below is the code for implementing this functionality.
#include <iostream>
#include <omp.h>
using namespace std;
void someFoo();
int main() {
someFoo();
}
void someFoo() {
long sum = 0;
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j ) {
omp_set_num_threads(numOfThreads[j]);
start = omp_get_wtime();
#pragma omp parallel for
for(int i = 0; i<50000000; i ) {
sum = i * 10;
}
#pragma omp end parallel
end = omp_get_wtime();
cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
}
}
It is expected that in 4 threads the program will run faster than in 1, in 8 threads faster than in 4 threads, and in 16 threads faster than in 8 threads, but in practice this is not the case - everything is executed at a chaotic speed and there is almost no difference. Also, it is not visible in the task manager that the program is parallelized. I have a computer with 8 logical processors and 4 cores.
Please tell me where I made a mistake and how to properly parallelize the loop in N threads.
CodePudding user response:
There is a race condition in your code because sum
is read/written from multiple threads at the same time. This should cause wrong results. You can fix this using a reduction with the directive #pragma omp parallel for reduction( :sum)
. Note that OpenMP does not check if your loop can be parallelized, it is your responsibility.
Additionally, the parallel computation might be slower than the the sequential one since a clever compiler can see that sum = 50000000*(50000000-1)/2*10 = 12499999750000000
(AFAIK, Clang does that). As a result, the benchmark is certainly flawed. Note that this is certainly bigger than what the type long
can contain so there is certainly an overflow in your code.
Moreover, AFAIK, there is no such thing as the directive #pragma omp end parallel
.
Finally, note that you can control the number of thread using the OMP_NUM_THREADS
environment variable which is generally more convenient than setting it in your application (Hardwiring a given number of thread in the application code is generally not a good idea, even for benchmarks).
CodePudding user response:
Please tell me where I made a mistake and how to properly parallelize the loop in N threads.
First you need to fix some of the compiler issues on your code example. Like removing pragmas like #pragma omp end parallel
, declaring the variables correctly and so on. Second you need to fix the race condition during the update of the variable sum
. That variable is shared among threads an updated concurrently. The easiest way would be to use the reduction
clause of OpenMP, you code would look like the following:
#include <stdio.h>
#include <omp.h>
int main() {
someFoo();
}
void someFoo() {
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j ) {
omp_set_num_threads(numOfThreads[j]);
double start = omp_get_wtime();
double sum = 0;
#pragma omp parallel for reduction( :sum)
for(int i = 0; i<50000000; i ) {
sum = i * 10;
}
double end = omp_get_wtime();
printf("Result: '%d' : '%f'\n", sum, (end - start));
}
}
With that you should get some speedups when running with multi-cores.
NOTE: To solve the overflow mentioned first by @Jérôme Richard, I change the 'sum' variable from long to a double.