The following code changes the number of parallel threads used for alternating parallel fors.
#include <iostream>
#include <chrono>
#include <vector>
#include <omp.h>
std::vector<float> v;
float foo(const int tasks, const int perTaskComputation, int threadsFirst, int threadsSecond)
{
float total = 0;
std::vector<int>nthreads{threadsFirst,threadsSecond};
for (int nthread : nthreads) {
omp_set_num_threads(nthread);
#pragma omp parallel for
for (int i = 0; i < tasks; i) {
for (int n = 0; n < perTaskComputation; n) {
if (v[i] > 5) {
v[i] * 0.002;
}
v[i] *= 1.1F * (i 1);
}
}
for (auto a : v) {
total = a;
}
}
return total;
}
int main()
{
int tasks = 1000;
int load = 1000;
v.resize(tasks, 1);
for (int threadAdd = 0; threadAdd <= 1; threadAdd) {
std::cout << "Run batch\n";
for (int j = 1; j <= 16; j = 1) {
float minT = 1e100;
float maxT = 0;
float totalT = 0;
int samples = 0;
int iters = 100;
for (float i = 0; i <= iters; i) {
auto start = std::chrono::steady_clock::now();
foo(tasks, load, j, j threadAdd);
auto end = std::chrono::steady_clock::now();
float ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() * 0.001;
if (i > 20) {
minT = std::min(minT, ms);
maxT = std::max(maxT, ms);
totalT = ms;
samples ;
}
}
std::cout << "Run parallel fors with " <<j << " and " << j threadAdd << " threads -- Min: "
<< minT << "ms Max: " << maxT << "ms Avg: " << totalT / samples << "ms" << std::endl;
}
}
}
When compiled and run with Visual Studio 2019 in Release mode, this is the output:
Run batch
Run parallel fors with 1 and 1 threads -- Min: 2.065ms Max: 2.47ms Avg: 2.11139ms
Run parallel fors with 2 and 2 threads -- Min: 1.033ms Max: 1.234ms Avg: 1.04876ms
Run parallel fors with 3 and 3 threads -- Min: 0.689ms Max: 0.759ms Avg: 0.69705ms
Run parallel fors with 4 and 4 threads -- Min: 0.516ms Max: 0.578ms Avg: 0.52125ms
Run parallel fors with 5 and 5 threads -- Min: 0.413ms Max: 0.676ms Avg: 0.4519ms
Run parallel fors with 6 and 6 threads -- Min: 0.347ms Max: 0.999ms Avg: 0.404413ms
Run parallel fors with 7 and 7 threads -- Min: 0.299ms Max: 0.786ms Avg: 0.346387ms
Run parallel fors with 8 and 8 threads -- Min: 0.263ms Max: 0.948ms Avg: 0.334ms
Run parallel fors with 9 and 9 threads -- Min: 0.235ms Max: 0.504ms Avg: 0.273937ms
Run parallel fors with 10 and 10 threads -- Min: 0.212ms Max: 0.702ms Avg: 0.287325ms
Run parallel fors with 11 and 11 threads -- Min: 0.195ms Max: 1.104ms Avg: 0.414437ms
Run parallel fors with 12 and 12 threads -- Min: 0.354ms Max: 1.01ms Avg: 0.441238ms
Run parallel fors with 13 and 13 threads -- Min: 0.327ms Max: 3.577ms Avg: 0.462125ms
Run parallel fors with 14 and 14 threads -- Min: 0.33ms Max: 0.792ms Avg: 0.463063ms
Run parallel fors with 15 and 15 threads -- Min: 0.296ms Max: 0.723ms Avg: 0.342562ms
Run parallel fors with 16 and 16 threads -- Min: 0.287ms Max: 0.858ms Avg: 0.372075ms
Run batch
Run parallel fors with 1 and 2 threads -- Min: 2.228ms Max: 3.501ms Avg: 2.63219ms
Run parallel fors with 2 and 3 threads -- Min: 2.64ms Max: 4.809ms Avg: 3.07206ms
Run parallel fors with 3 and 4 threads -- Min: 5.184ms Max: 14.394ms Avg: 8.30909ms
Run parallel fors with 4 and 5 threads -- Min: 5.489ms Max: 8.572ms Avg: 6.45368ms
Run parallel fors with 5 and 6 threads -- Min: 6.084ms Max: 15.739ms Avg: 7.71035ms
Run parallel fors with 6 and 7 threads -- Min: 7.162ms Max: 16.787ms Avg: 7.8438ms
Run parallel fors with 7 and 8 threads -- Min: 8.32ms Max: 39.971ms Avg: 10.0409ms
Run parallel fors with 8 and 9 threads -- Min: 9.575ms Max: 45.473ms Avg: 11.1826ms
Run parallel fors with 9 and 10 threads -- Min: 10.918ms Max: 31.844ms Avg: 14.336ms
Run parallel fors with 10 and 11 threads -- Min: 12.134ms Max: 21.199ms Avg: 14.3733ms
Run parallel fors with 11 and 12 threads -- Min: 13.972ms Max: 21.608ms Avg: 16.3532ms
Run parallel fors with 12 and 13 threads -- Min: 14.605ms Max: 18.779ms Avg: 15.9164ms
Run parallel fors with 13 and 14 threads -- Min: 16.199ms Max: 26.991ms Avg: 19.3464ms
Run parallel fors with 14 and 15 threads -- Min: 17.432ms Max: 27.701ms Avg: 19.4463ms
Run parallel fors with 15 and 16 threads -- Min: 18.142ms Max: 26.351ms Avg: 20.6856ms
Run parallel fors with 16 and 17 threads -- Min: 20.179ms Max: 40.517ms Avg: 22.0216ms
In a first batch, several runs with increasing number of threads are done, alternating parallel fors using the same number of threads. This batch produces an expected behavior, increasing preformance as the number of threads is increase.
Then a second batch is done, runing the same code but alternating parallel fors where one of them uses one more thread than the other. This second batch has a severe performance loss, increasing the computation time up to a factor of 50~100x.
Compiling and runing with gcc in Ubuntu leads to an expected behavior, with both batches performing similarly.
So, the question is, what is causing this huge performance loss when using Visual Studio?
CodePudding user response:
As to the experiments explained in the comments to the question, and with a lack of a better explanation, it seems to be a bug in the VS runtime.