I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.
void omp2(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; i) {
#pragma omp parallel for
for (int ki = 0; ki < k; ki) {
for (int j = 0; j < n; j) {
C[i * n j] = A[i * k ki] * B[ki * n j];
}
}
}
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
for (int i = 0; i < m; i) {
for (int ki = 0; ki < k; ki) {
#pragma omp parallel for
for (int j = 0; j < n; j) {
C[i * n j] = A[i * k ki] * B[ki * n j];
}
}
}
}
CodePudding user response:
The problem is that there is a race condition in this line:
C[i * n j] = ...
Different threads can read and write the same memory location (C[i * n j]
) simultaneously, which causes data race. In omp2
this data race can occur, but not in omp3
.
The solution is (as suggested by @Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n j]
is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):
#pragma omp parallel for
for (int i = 0; i < m; i) {
for (int j = 0; j < n; j) {
double sum=0;
for (int ki = 0; ki < k; ki) {
sum = A[i * k ki] * B[ki * n j];
}
C[i * n j] =sum;
}
}
Note that you can use collapse(2)
clause, which may increase the performance.