I have a following program which uses std::atomic_thread_fence
s:
int data1 = 0;
std::atomic<int> data2 = 0;
std::atomic<int> state;
int main() {
state.store(0);
data1 = 0;
data2 = 0;
std::thread t1([&]{
data1 = 1;
state.store(1, std::memory_order_release);
});
std::thread t2([&]{
auto s = state.load(std::memory_order_relaxed);
if (s != 1) return;
std::atomic_thread_fence(std::memory_order_acquire);
data2.store(data1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
state.store(2, std::memory_order_relaxed);
});
std::thread t3([&]{
auto d = data2.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
if (state.load(std::memory_order_relaxed) == 0) {
std::cout << d;
}
});
t1.join();
t2.join();
t3.join();
}
It consists of 3 threads and one global atomic variable state
used for synchronization. First thread writes some data to a global, non-atomic variable data1
and sets state
to 1. Second thread reads state
and if it's equal to 1
it modifies assigns data1
to another global non-atomic variable data2
. After that, it stores 2
into state
. This thread reads the content of data2
and then checks state
.
Q: Will the third thread always print 0
? Or is it possible for the third thread to see update to data2
before update to state
? If so, is the only solution to guarantee that to use seq_cst memory order?
CodePudding user response:
I think that t3 can print 1.
I believe the basic issue is that the release fence in t2 is misplaced. It is supposed to be sequenced before the store that is to be "upgraded" to release, so that all earlier loads and stores become visible before the later store does. Here, it has the effect of "upgrading" the state.store(2)
. But that is not helpful because nobody is trying to use the condition state.load() == 2
to order anything. So the release fence in t2 doesn't synchronize with the acquire fence in t3. Therefore you do not get any stores to happen-before any of the loads in t3, so you get no assurance at all about what values they might return.
The fence really ought to go before data2.store(data1)
, and then it should work. You would be assured that anyone who observes that store will thereafter observe all prior stores. That would include t1's state.store(1)
which is ordered earlier because of the release/acquire pair between t1 and t2.
So if you change t2 to
auto s = state.load(std::memory_order_relaxed);
if (s != 1) return;
std::atomic_thread_fence(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_release); // moved
data2.store(data1, std::memory_order_relaxed);
state.store(2, std::memory_order_relaxed); // irrelevant
then whenever data2.load()
in t3 returns 1, the release fence in t2 synchronizes with the acquire fence in t3 (see C 20 atomics.fences p2). The t2 store to data2
only happened if the t2 load of state
returned 1, which would ensure that the release store in t1 synchronizes with the acquire fence in t2 (atomics.fences p4). We then have
t1 state.store(1)
synchronizes with
t2 acquire fence
sequenced before
t2 release fence
synchronizes with
t3 acquire fence
sequenced before
t3 state.load()
so that state.store(1)
happens before state.load()
, and thus state.load()
cannot return 0 in this case. This would ensure the desired ordering without requiring seq_cst
.
To imagine how the original code could actually fail, think about something like POWER, where certain sets of cores get special early access to snoop stores from each others' store buffers, before they hit L1 cache and become globally visible. Then an acquire barrier just has to wait until all earlier loads are complete; while a release barrier should drain not only its own store buffer, but also all other store buffers that it has access to.
So suppose core1 and core2 are such a special pair, but core3 is further away and only gets to see stores after they are written to L1 cache. We could have:
core1 core2 L1 cache core3
===== ===== ======== =====
data1 <- 1
release data1 <- 1
state <- 1
(still in store buffer)
1 <- state
acquire
1 <- data1
data2 <- 1 data2 <- 1
1 <- data2
acquire
0 <- state
release state <- 1
state <- 2 state <- 2
The release barrier in core 2 does cause the store buffer of core 1 to drain and thus write state <- 1
to L1 cache, but by then it is too late.