#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
using namespace std;
const int FLAG1 = 1, FLAG2 = 2, FLAG3 = 3;
int res = 0;
atomic<int> flagger;
void func1()
{
for (int i=1; i<=1000000; i ) {
while (flagger.load(std::memory_order_relaxed) != FLAG1) {}
res ; // maybe a bunch of other code here that don't modify flagger
// code here must not be moved outside the load/store (like mutex lock/unlock)
flagger.store(FLAG2, std::memory_order_relaxed);
}
cout << "Func1 finished\n";
}
void func2()
{
for (int i=1; i<=1000000; i ) {
while (flagger.load(std::memory_order_relaxed) != FLAG2) {}
res ; // same
flagger.store(FLAG3, std::memory_order_relaxed);
}
cout << "Func2 finished\n";
}
void func3() {
for (int i=1; i<=1000000; i ) {
while (flagger.load(std::memory_order_relaxed) != FLAG3) {}
res ; // same
flagger.store(FLAG1, std::memory_order_relaxed);
}
cout << "Func3 finished\n";
}
int main()
{
flagger = FLAG1;
std::thread first(func1);
std::thread second(func2);
std::thread third(func3);
first.join();
second.join();
third.join();
cout << "res = " << res << "\n";
return 0;
}
My program has a segment that is similar to this example. Basically, there are 3 threads: inputer, processor, and outputer. I found that busy wait using atomic is faster than putting threads to sleep using condition variable, such as:
std::mutex commonMtx;
std::condition_variable waiter1, waiter2, waiter3;
// then in func1()
unique_lock<std::mutex> uniquer1(commonMtx);
while (flagger != FLAG1) waiter1.wait(uniquer1);
However, is the code in the example safe? When I run it gives correct results (-std=c 17 -O3
flag). However, I don't know whether the compiler can reorder my instructions to outside the atomic check/set block, especially with std::memory_order_relaxed
. In case it's unsafe, then is there any way to make it safe while being faster than mutex?
Edit: it's guaranteed that the number of thread is < number of CPU cores
CodePudding user response:
std::memory_order_relaxed
results in no guarantees on the ordering of memory operations except on the atomic itself.
All your res ;
operations therefore are data races and your program has undefined behavior.
Example:
#include<atomic>
int x;
std::atomic<int> a{0};
void f() {
x = 1;
a.store(1, std::memory_order_relaxed);
x = 2;
}
Clang 13 on x86_64 with -O2
compiles this function to
mov dword ptr [rip a], 1
mov dword ptr [rip x], 2
ret
(https://godbolt.org/z/hxjYeG5dv)
Even on a cache-coherent platform, between the first and second mov
, another thread can observe a
set to 1
, but x
not set to 1
.
You must use memory_order_release
and memory_order_acquire
(or sequential consistency) to replace a mutex.
(I should add that I have not checked your code in detail, so I can't say that simply replacing the memory order is sufficient in your specific case.)
CodePudding user response:
As mentioned in the other answer, res ;
in different threads are not synchronized with each other, and cause a data race and undefined behavior. This can be checked with a thread sanitizer.
To fix this, you need to use memory_order_acquire
to loads and memory_order_release
for stores. The fix can too be confirmed with a thread sanitizer.
while (flagger.load(std::memory_order_acquire) != FLAG1) {}
res ;
flagger.store(FLAG2, std::memory_order_release);
Or, flagger.load(std::memory_order_acquire)
can be replaced with flagger.load(std::memory_order_relaxed)
, followed by std::atomic_thread_fence(std::memory_order_acquire);
after the loop.
while (flagger.load(std::memory_order_relaxed) != FLAG1) {}
std::atomic_thread_fence(std::memory_order_acquire);
res ;
flagger.store(FLAG2, std::memory_order_release);
I'm not sure how much it improves performance, if at all.
The idea is that only the last load in the loop needs to be an acquire operation, which is what the fence achieves.