Here is a code I wrote for testing the multithreading performance. In summary it performs some long calculation in the loop, accumulates the results and measures the time it takes. Accumulating the results requires placing the lock in one place. The problem is, that using the lock on this single line kills multithreading performance. Why?
I have also measured the time it takes to lock/unlock the mutex. I compile the code with g O3
option.
#include <chrono>
#include <cmath>
#include <functional>
#include <iomanip>
#include <iostream>
#include <mutex>
#include <vector>
#include <thread>
long double store;
std::mutex lock;
using ftype=std::function<long double(long int)>;
using loop_type=std::function<void(long int, long int, ftype)>;
///simple class to time the execution and print result.
struct time_n_print
{
time_n_print() :
start(std::chrono::high_resolution_clock::now())
{}
~time_n_print()
{
auto elapsed = std::chrono::high_resolution_clock::now() - start;
auto ms = std::chrono::duration_cast<std::chrono::microseconds>(elapsed);
std::cout << "Elapsed(ms)=" << std::setw(7) << ms.count();
std::cout << "; Result: " << (long int)(store);
}
std::chrono::high_resolution_clock::time_point start;
};//class time_n_print
///do long and pointless calculations which result in 1.0
long double slow(long int i)
{
long double pi=3.1415926536;
long double i_rad = (long double)(i) * pi / 180;
long double sin_i = std::sin(i_rad);
long double cos_i = std::cos(i_rad);
long double sin_sq = sin_i * sin_i;
long double cos_sq = cos_i * cos_i;
long double log_sin_sq = std::log(sin_sq);
long double log_cos_sq = std::log(cos_sq);
sin_sq = std::exp(log_sin_sq);
cos_sq = std::exp(log_cos_sq);
long double sum_sq = sin_sq cos_sq;
long double result = std::sqrt(sum_sq);
return result;
}
///just return 1
long double fast(long int)
{
return 1.0;
}
///sum everything up with mutex
void loop_guarded(long int a, long int b, ftype increment)
{
for(long int i = a; i < b; i)
{
long double inc = increment(i);
{
std::lock_guard<std::mutex> guard(lock);
store = inc;
}
}
}//loop_guarded
///sum everything up without locks
void loop_unguarded(long int a, long int b, ftype increment)
{
for(long int i = a; i < b; i)
{
long double inc = increment(i);
{
store = inc;
}
}
}//loop_unguarded
//run calculations on multiple threads.
void run_calculations(int size,
int nthreads,
loop_type loop,
ftype increment)
{
store = 0.0;
std::vector<std::thread> tv;
long a(0), b(0);
for(int n = 0; n < nthreads; n)
{
a = b;
b = n < nthreads - 1 ? a size / nthreads : size;
tv.push_back(std::thread(loop, a, b, increment));
}
//Wait, until all threads finish
for(auto& t : tv)
{
t.join();
}
}//run_calculations
int main()
{
long int size = 10000000;
{
std::cout << "\n1 thread - fast, unguarded : ";
time_n_print t;
run_calculations(size, 1, loop_unguarded, fast);
}
{
std::cout << "\n1 thread - fast, guarded : ";
time_n_print t;
run_calculations(size, 1, loop_guarded, fast);
}
std::cout << std::endl;
{
std::cout << "\n1 thread - slow, unguarded : ";
time_n_print t;
run_calculations(size, 1, loop_unguarded, slow);
}
{
std::cout << "\n2 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 2, loop_unguarded, slow);
}
{
std::cout << "\n3 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 3, loop_unguarded, slow);
}
{
std::cout << "\n4 threads - slow, unguarded : ";
time_n_print t;
run_calculations(size, 4, loop_unguarded, slow);
}
std::cout << std::endl;
{
std::cout << "\n1 thread - slow, guarded : ";
time_n_print t;
run_calculations(size, 1, loop_guarded, slow);
}
{
std::cout << "\n2 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 2, loop_guarded, slow);
}
{
std::cout << "\n3 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 3, loop_guarded, slow);
}
{
std::cout << "\n4 threads - slow, guarded : ";
time_n_print t;
run_calculations(size, 4, loop_guarded, slow);
}
std::cout << std::endl;
return 0;
}
Here is the typical output on Linux machine with 4 cores:
>1 thread - fast, unguarded : Elapsed(ms)= 32826; Result: 10000000
>1 thread - fast, guarded : Elapsed(ms)= 172208; Result: 10000000
>
>1 thread - slow, unguarded : Elapsed(ms)=2131659; Result: 10000000
>2 threads - slow, unguarded : Elapsed(ms)=1079671; Result: 9079646
>3 threads - slow, unguarded : Elapsed(ms)= 739284; Result: 8059758
>4 threads - slow, unguarded : Elapsed(ms)= 564641; Result: 7137484
>
>1 thread - slow, guarded : Elapsed(ms)=2198650; Result: 10000000
>2 threads - slow, guarded : Elapsed(ms)=1468137; Result: 10000000
>3 threads - slow, guarded : Elapsed(ms)=1306659; Result: 10000000
>4 threads - slow, guarded : Elapsed(ms)=1549214; Result: 10000000
So what we can see
- locking/unlocking the mutex takes actually quite a long time, comparing with, say, incrementing the long double value;
- Without mutex the gain for multithreading is very good, as expected. And, as expected, we lose quite a lot of increments, due to racing;
- With mutex there is no gain beyond 2 threads;
The main question -- why part of the code which takes < 10% of the execution time kills performance so dramatically?
I understand, that I can work around this, by accumulating results in each thread separately and then sum them up in the end. But why this problem appears in the first place?
UPDATE: Thanks for the answers and comments. Basically, what it boils down to -- we cannot get a good performance gain if each thread spends 7-8% of the time in locked state. If, in the code above, I add 10-loop to the slow
function then performance gain between guarded and unguarded versions are identical upto 4 threads. So, for me the rule of thumb now -- the time spent in the locked state should not exceed 1% of the total execution time.
CodePudding user response:
Locking a contested1 mutex requires a system call and everything that involves: a context switch to the operating system, which will possibly schedule some other process so that when you return all your caches are invalidated etc. This is inherently a fairly expensive operation. Because your slow
function isn't that expensive, all things considered, there seems to remain enough contention on the mutex that the code has to go to the operation system relatively often, and this leads to a noticeable drag in performance.
It'd be good practice to have every thread aggregate its results in a variable of its own, then update once at the end in bulk, so that you need only a single mutex lock for the whole thing at the end. Generally, if you're going to synchronize with a mutex and care about performance, you'll need to find ways to split your work into coarse enough chunks that the mutex doesn't become a significant drag. I'm afraid that's just normal.
Otherwise, lock-free data structures offer an alternative. They avoid going to the operating system for locking, but many of them will sort-of busy-wait for each other if contention becomes too high. If this is not the case, they're well worth a look if you're going for performance, though.
1As Solomon points out in the comments, if the mutex is uncontested, the lock operation (with modern CPUs and operating systems) can be done in user space and is therefore much, much chaper.