I have a driver cpp file that calls cblas_dgbmv function with proper arguments. When I build OpenBLAS with "make", dgbmv runs with 8 threads automatically (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour). On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well. All good for now.
The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time. My driver is sequential. However, even 2 threaded dgbmv degrades the performance (execution time), being a single multithreaded call, without the loop.
I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications. There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock). IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS. However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.
I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :
double seconds,timing=0.0;
//for(int i=0; i<10000; i ){
seconds = omp_get_wtime ( );
cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
timing = omp_get_wtime ( ) - seconds;
// }
I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...). Here is my Makefile to compile both lbrary and the application :
myBinary: myBinary.cpp
cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4 && make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1 install
g myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc fs -std=c 17
Architecture : 64 cores shared memory with AMD Opteron Processors
I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.
CodePudding user response:
In my own program that scales well (different than multithreaded openblas mentioned above) i have tried setting GOMP_CPU_AFFINITY to 0..8 and PROC_BIND to true, and also OMP_PLACES to threads(8) for the sake of running 8 threads on the first 8 cpus (or cores) with no hyperthreading. then i have visually checked via htop utility every thread is being executed on the first numa node with 8 processors. after ensuring that, result was 5 seconds longer. by unsetting these variables, i got 5 secs faster result. @JérômeRichard. I2ll try the same thing for openblas driver as well.
CodePudding user response:
I have just tried what I have written in the other comment(settings for my own openmp program) for openblas. I have built the library with make USE_OPENMP=1 (as i stated its a sequential driver anyways). Then, i have unsetted all env variables including openblas_num_threads. To run it, my command is :
OMP_NUM_THREADS="4" GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7" OMP_PLACES="threads(4)" OMP_PROC_BIND="TRUE" ./myBinary 3 0.008 0.3 1
obviously the last ones are arguments of the program. then i have checked with htop again. mostly 2 threads are running with full (100%) cpu utilization, but not in the first 8 cpus. one thread is on cpu 1 and the other is 27 (max 64 exists). I think there is still a problem. @JérômeRichard
For the second time I have built openblas with make USE_OPENMP=1 NUM_THREAD=4 . It successfuly builds and states I need to use OMP_NUM_THREADS.
Then I have run my binary with
OMP_NUM_THREADS="4" GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7" OMP_PLACES="threads(4)" OMP_PROC_BIND="TRUE" OPENBLAS_NUM_THREADS="1" USE_OPENMP="1" ./myBinary 3 0.008 0.3 1
This is even worse. There is only 1 100% utilizing thread in htop and the info box states only 2 exist in the system.