Home > Blockchain >  OpenMP incredibly slow when another process is running
OpenMP incredibly slow when another process is running

Time:11-30

When trying to use OpenMP in a C application I ran into severe performance issues where the multi-threaded performance could be up to 1000x worse compared to single threaded. This only happens if at least one core is maxed out by another process.

After some digging I could isolate the issue to a small example, I hope someone can shed some light on this issue!

Minimal example

Here is a minimal example which illustrates the problem:

#include <iostream>

int main() {
    int sum = 0;
    for (size_t i = 0; i < 1000; i  ) {
        #pragma omp parallel for reduction( :sum)
        for (size_t j = 0; j < 100; j  ) {
            sum  = i;
        }
    }
    
    std::cout << "Sum was: " << sum << std::endl;
}

I need the OpenMP directive to be inside the outer for-loop since my real code is looping over timesteps which are dependent on one another.

My setup

I ran the example on Ubuntu 21.04 with an AMD Ryzen 9 5900X (12 cores, 24 threads), and compiled it with G 10.3.0 using g -fopenmp example.cc.

Benchmarking

If you run this program with nothing else in the background it terminates quickly:

> time ./a.out
Sum was: 999000

real    0m0,006s
user    0m0,098s
sys     0m0,000s

But if a single core is used by another process it runs incredibly slowly. In this case I ran stress -c 1 to simulate another process fully using a core in the background.

> time ./a.out
Sum was: 999000

real    0m8,060s
user    3m2,535s
sys     0m0,076s

This is a slowdown by 1300x. My machine has 24 parallel threads so the theoretical slowdown should only be around 4% when one is busy and 23 others are available.

Findings

The problem seems to be related to how OpenMP allocates/assigns the threads.

  • If I move the omp-directive to the outer loop the issue goes away
  • If I explicitly set the thread count to 23 the issue goes away (num_threads(23))
  • If I explicitly set the thread count to 24 the issue remains
  • How long it takes for the process to terminate varies from 1-8 seconds
  • The program constantly uses as much of the cpu as possible when it's running, I assume most of the OpenMP threads are in spinlocks

From these findings it would seem like OpenMP assigns the jobs to all cores, including the one that is already maxed out, and then somehow forcing each individual core to finish its tasks and not allowing them to be redistributed when other cores are done.

I have tried changing the scheduling to dynamic but that didn't help either.

I would be very helpful for any suggestions, I'm new to OpenMP so it's possible that I've made a mistake. What do you make of this?

CodePudding user response:

So here is what I could figure out:

Run the program with OMP_DISPLAY_ENV=verbose (see https://www.openmp.org/spec-html/5.0/openmpch6.html for a list of environment variables)

The verbose setting will show you OMP_WAIT_POLICY = 'PASSIVE' and GOMP_SPINCOUNT = '300000'. In other words, when a thread has to wait, it will spin for some time before going to sleep, consuming CPU time and blocking one CPU. This will happen each time the thread reaches the end of the loop or before the the master thread distributes the for loop, or maybe even before the parallel section starts.

Because GCC's libgomp does not use pthread_yield, this effectively blocks one CPU thread. Because you have more running software threads than CPU threads, one will not be running, causing all others to busy-wait until the kernel scheduler reassigns the CPU.

If you call your program with OMP_WAIT_POLICY=passive, GCC will set GOMP_SPINCOUNT = '0'. Then the kernel will immediately put waiting threads to sleep and allow the others to run. Now your performance will be much better.

Interestingly enough OMP_PROC_BIND=true also helps. I assume immovable threads affect the kernel scheduler in some way that benefits us but I'm not sure.

Clang's OpenMP implementation does not suffer from this performance degradation because it uses pthread_yield. Of course this has its own drawbacks if syscall overhead is large and in most computing environments, it should be unnecessary because you are not supposed to overcommit CPUs.

  • Related