I have the following double loop where I compute the element of matrix Fisher_M[FX][FY]
.
I tried to optimize it by putting a OMP pragma #pragma omp parallel for schedule(dynamic, num_threads)
but gain is not as good as expected.
Is there a way to do a reduction witht OpenMP (of sum) to compute quickly the element Fisher_M[FX][FY]
? Or maybe this is doable with MAGMA or CUDA ?
#define num_threads 8
#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<CO_CL_WL.size(); i ){
for(int j=0; j<CO_CL_WL.size(); j ){
if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
Fisher_M[FX][FY] = CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
}
}
}
CodePudding user response:
Your code has a race condition at line Fisher_M[FX][FY] = ...
. Reduction can be used to solve it:
double sum=0; //change the type as needed
#pragma omp parallel for reduction( :sum)
for(int i=0; i<CO_CL_WL.size(); i ){
for(int j=0; j<CO_CL_WL.size(); j ){
if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
sum = CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
}
}
}
Fisher_M[FX][FY] = sum;
Note that this code is memory bound, not computation expensive, so the perfomance gain by parallelization may be smaller than expected (and depends on your hardware).
Ps: Why do you need this condition if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0)
? If any of them is zero, the sum will not change. If you remove it, the compiler can make much better vectorized code.
Ps2: In the schedule(dynamic, num_threads)
clause the second parameter is the chunk size not the number of threads used. I suggest removing it in your your case. If you wish to specify the number of threads used, please add num_threads
clause or use omp_set_num_threads
function.