For example, is calling std::mutex::lock() required by the Standard to provide a sequentially consistent fence, an acquire fence, or neither?
cppreference.com doesn't seem to address this topic. Is it addressed in any reference documentation that's more easy to use than the Standard or working papers?
CodePudding user response:
I'm not sure about an easier source, but here's a quote from a note in the standard:
[...] a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A.
I think that answers the question about memory fences reasonably well (and although it's "only" a note, not a normative part of the standard, I'd say it's as reliable a description of the standard as any other site could hope to provide).
CodePudding user response:
std::atomic
and std::mutex
operations never require full 2-way fences. That does happen in practice on some ISAs as an implementation detail, notably x86, but not AArch64.
Even std::atomic<T>
atomic RMWs with the default memory_order_seq_cst
aren't as strong as full 2-way fences, I think. On real ISAs where SC RMWs can be done without being much stronger than required (specifically AArch64), I'm not sure they stop relaxed operations on opposite sides from reordering with each other. (Happening between the load and store parts of the atomic RMW).
As Jerry Coffin says, taking a std::mutex
is only an acquire operation in the ISO C standard, not an acquire fence. It's not like std::atomic_thread_fence(std::memory_order_acquire)
, it's only required to be as strong as foo.exchange(std::memory_order_acquire)
.
The lack of mention of requiring a 2-way fence makes it clear that one isn't required or guaranteed by the standard. An acquire operation like taking a mutex allows 1-way reordering with itself, so relaxed operations before/after it can potentially reorder with each other. (That's why fences and operations are different things.)
Being any stronger than that is an implementation detail. For example on x86 where any atomic RMW operation is a full barrier, waiting for the store buffer to drain itself, and for all earlier loads to complete, before RMWing the cache line. So it's like a std::atomic_thread_fence(seq_cst)
tied to the foo.exchange()
; in fact a dummy lock add byte [rsp], 0
is how most compilers implement that C fence, because unfortunately mfence
is slower on most CPUs.
Taking a mutex always require an atomic RMW, but some machines can do that in ways that allow limited reordering with surrounding operations. e.g. AArch64 can use ldaxr
(sequential-acquire load-linked) / stxr
(plain store-conditional, not stlxr
with release semantics) to implement .exchange(acquire)
or .compare_exchange_weak(acquire)
. See an example compiling to asm for AArch64 on Godbolt, and also atomic exchange with memory_order_acquire and memory_order_release and For purposes of ordering, is atomic read-modify-write one operation or two?