I have a function void dynamics (A a, std::vector<double> &, std::vector<double> &, std::vector<double> )
which I am calling from threads created by openmp. The inputs to the function are private to each thread (created within the parallel block)
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
class A {
some code
};
int main(void)
{
vector<double> a (12,0.0);
vector<double> b (12,0.0);
#pragma omp parallel for shared(a,b)
for(int id = 0; id < 6; id ) {
vector<double> a_private (2,0.0);
vector<double> b_private (2,0.0);
vector<double> c_private (2,(double)id);
A d;
start_time for each thread - chrono
dynamics(d,a_private,b_private,c_private);
end_time for each thread - chrono
calculate_time for each thread
# pragma omp critical
{
for(int i = 0; i < 2; i ) a[i (2*id)] = a_private[i];
for(int i = 0; i < 2; i ) b[i (2*id)] = b_private[i];
}
}
print(a);
print(b);
return 0;
}
Here, to avoid race condition, I have put the assignment of a_private and b_private into a and b within critical section.
When I calculate the time for above code for each threads, it is more than the time if I put the dynamics function within the critical section.
# pragma omp critical
{
start_time for each thread - chrono
dynamics(d,a_private,b_private,c_private);
end_time for each thread - chrono
calculate_time for each thread
for(int i = 0; i < 2; i ) a[i (2*id)] = a_private[i];
for(int i = 0; i < 2; i ) b[i (2*id)] = b_private[i];
}
The output (a and b) at the end is same in both the cases (running the code multiple times give same results). Thus, I believe dynamics is thread safe (could it not be thread safe?).
The inputs to dynamics are created within the parallel region. Thus, they should be private to each thread (are they?).
Why are the threads running slowly to calculate the dynamics when working together, compared to when working one after another (within critical section).
I believe the overhead of creating and managing threads would not be a problem as I am comparing times where threads are always created (in both of my above cases).
The total time after parallelizing dynamics is lower than the serial version (speedup achieved) but why do threads take significantly different times (within critical vs not : to calculate thread times).
The explanation I could come up was that running dynamics creates race condition even if the input and output to it are private to each threads. (Could this be?)
Also, I am not using omp get num threads and omp get thread num.
What could be the issue here?
When running dynamics in parallel
ID = 3, Dynamics Time = 410233
ID = 2, Dynamics Time = 447835
ID = 5, Dynamics Time = 532967
ID = 1, Dynamics Time = 545017
ID = 4, Dynamics Time = 576783
ID = 0, Dynamics Time = 624855
When running dynamics in critical section
ID = 0, Dynamics Time = 331579
ID = 2, Dynamics Time = 303294
ID = 5, Dynamics Time = 307622
ID = 1, Dynamics Time = 340489
ID = 3, Dynamics Time = 303066
ID = 4, Dynamics Time = 293090
(Would not be able to provide the minimal reproduction of dynamics as it is proprietary of my professor)
Thank you.
CodePudding user response:
This is a quite classical/common case where the speed-up is not proportional to the number of threads. There can be several explanations:
- when a single thread is running alone (that is only one core is 100% loaded), the CPU frequency is boosted (turbo boost), while the CPU just sticks to the nomical frequency when several threads are running concurrently
- the code inside
dynamics()
is (near-) memory bound. That is, when several threads run concurrently you just tend to saturate the bandwidth between the CPU and the RAM, and the cores do not receive enough data to run at 100%
CodePudding user response:
I measured the calculation time of vector operations with almost similar to your code on single thread and multi threads. In addition, two different vector size is used in each test. When the number of vector is 1000, parallel is a little slow, but when n=1000000, parallel ends up being very slow than serial. As we will see, the volume of data which is treated by processes is larger, data traffic between cpu and memory is more heavy and then parallel performance results in degraded.
num vector = 10000
Parallel
id=0091152 time=10155
id=0082408 time=10169
id=0074644 time=10172
id=0135644 time=10269
id=0092996 time=10303
id=0133796 time=10348
id=0135884 time=10420
Serial
id=0132880 time=7635
id=0106048 time=7643
id=0072972 time=7618
id=0107080 time=7794
id=0100064 time=7942
id=0110648 time=8044
id=0111988 time=7849
---------------------
num vector = 1000000
Parallel
id=0069820 time=27000
id=0135668 time=27106
id=0118184 time=27144
id=0102572 time=27158
id=0046388 time=27165
id=0120604 time=27173
id=0044344 time=27188
Serial
id=0038320 time=5341
id=0133000 time=5253
id=0101508 time=5168
id=0004840 time=5212
id=0087408 time=5143
id=0130548 time=5199
id=0122764 time=5126
time unit=msec