Home > front end >  Severe performance loss alternating number of OpenMP parallel threads
Severe performance loss alternating number of OpenMP parallel threads

Time:12-01

The following code changes the number of parallel threads used for alternating parallel fors.

#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>

std::vector<float> v;

float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
    float total = 0;
    std::vector<int>nthreads{threadsFirst,threadsSecond};
    for (int nthread : nthreads) {
        omp_set_num_threads(nthread);
#pragma omp parallel for
        for (int i = 0; i < tasks;   i) {
            for (int n = 0; n < perTaskComputation;   n) {
                if (v[i] > 5) {
                    v[i] * 0.002;
                }
                v[i] *= 1.1F * (i   1);
            }
        }
        for (auto a : v) {
            total  = a;
        }
    }
    return total;
}

int main()
{
    int tasks = 1000;
    int load = 1000;
    v.resize(tasks, 1);
    for (int threadAdd = 0; threadAdd <= 1;   threadAdd) {
        std::cout << "Run batch\n";
        for (int j = 1; j <= 16; j  = 1) {
            float minT = 1e100;
            float maxT = 0;
            float totalT = 0;
            int samples = 0;
            int iters = 100;
            for (float i = 0; i <= iters;   i) {
                auto start = std::chrono::steady_clock::now();
                foo(tasks, load, j, j   threadAdd);
                auto end = std::chrono::steady_clock::now();
                float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
                if (i > 20) {
                    minT = std::min(minT, ms);
                    maxT = std::max(maxT, ms);
                    totalT  = ms;
                    samples  ;
                }
            }
            std::cout << "Run parallel fors with " <<j << " and " << j   threadAdd << " threads -- Min: "
                << minT << "ms   Max: " << maxT << "ms   Avg: " << totalT / samples << "ms" << std::endl;
        }
    }
}

When compiled and run with Visual Studio 2019 in Release mode, this is the output:

Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms   Max: 2.47ms   Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms   Max: 1.234ms   Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms   Max: 0.759ms   Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms   Max: 0.578ms   Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms   Max: 0.676ms   Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms   Max: 0.999ms   Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms   Max: 0.786ms   Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms   Max: 0.948ms   Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms   Max: 0.504ms   Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms   Max: 0.702ms   Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms   Max: 1.104ms   Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms   Max: 1.01ms   Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms   Max: 3.577ms   Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms   Max: 0.792ms   Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms   Max: 0.723ms   Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms   Max: 0.858ms   Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms   Max: 3.501ms   Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms   Max: 4.809ms   Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms   Max: 14.394ms   Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms   Max: 8.572ms   Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms   Max: 15.739ms   Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms   Max: 16.787ms   Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms   Max: 39.971ms   Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms   Max: 45.473ms   Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms   Max: 31.844ms   Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms   Max: 21.199ms   Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms   Max: 21.608ms   Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms   Max: 18.779ms   Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms   Max: 26.991ms   Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms   Max: 27.701ms   Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms   Max: 26.351ms   Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms   Max: 40.517ms   Avg: 22.0216ms

In a first batch, several runs with increasing number of threads are done, alternating parallel fors using the same number of threads. This batch produces an expected behavior, increasing preformance as the number of threads is increase.

Then a second batch is done, runing the same code but alternating parallel fors where one of them uses one more thread than the other. This second batch has a severe performance loss, increasing the computation time up to a factor of 50~100x.

Compiling and runing with gcc in Ubuntu leads to an expected behavior, with both batches performing similarly.

So, the question is, what is causing this huge performance loss when using Visual Studio?

CodePudding user response:

As to the experiments explained in the comments to the question, and with a lack of a better explanation, it seems to be a bug in the VS runtime.

  • Related