The strong-ness of x86 store instruction wrt. SC-DRF?-CodePudding

I read about Herb's atomic<> Weapons talk and had a question about page 42:

He mentioned that (50:00 in the video):

(x86) stores are much stronger than they need to be...

What I don't understand is: if the x86 "S" on the chart is a plain store, i.e. mov, I don't think it's stronger than SC-DRF because it's only a release store plus total store order (and that's why you need an xchg for a SC store). But if it means an SC store, i.e. xchg, it should fall on the "fully SC" bar because it's effectively a full barrier. How should I take this x86 "S"'s strong-ness on the chart?

(SC-DRF is a guarantee of Sequentially Consistent execution for Data Race Free programs, as long as they don't use any atomics with orders weaker than std::memory_order_seq_cst. ISO C and Java, and other languages, provide this.)

CodePudding user response：

Yes, he's showing xchg there (full barrier and an RMW operation), not just a mov store - a plain mov would be below the SC-DRF bar because it doesn't provide sequential consistency on its own without mfence or other barrier.

Compare ARM64 stlr / ldar - they can't reorder with each other (not even StoreLoad), but stlr can reorder with other later operations, except of course other release-store operations, or some fences. (Like I mentioned in answer to your previous question). See also Does STLR(B) provide sequential consistency on ARM64? re: interaction with ldar for SC vs. ldapr for just acquire / release or acq_rel. Also Possible orderings with memory_order_seq_cst and memory_order_release for another example of how AArch64 compiles (without ARMv8.3 LDAPR).

But x86 seq_cst stores drain the store buffer on the spot, even if there is no later seq_cst load, store, or RMW in the same thread. This lack of reordering with later non-SC or non-atomic loads/stores is what makes it stronger (and more expensive) than necessary.

Herb Sutter explained this earlier in the video, at around 36:00. He points out xchg is stronger than necessary, not just an SC-release that can one-way reorder with later non-SC operations. "So what we have here, is overkill. Much stronger than is necessary" at 36:30

(Side note: right around 36:00, he mis-spoke: he said "we're not going to use these first 3 guarantees" (that x86 doesn't reorder loads with loads or stores with stores, or stores with older loads). But those guarantees are why SC load can be just a plain mov. Same for acq/rel being just plain mov for both load and store. That's why as he says, lfence and sfence are irrelevant for std::atomic.)

So anyway, ARM64 can hit the sweet spot with no extra barrier instructions, being exactly strong enough for seq_cst but no stronger. (ARMv8.3 with ldapr is slightly stronger than acq_rel requires, e.g. ARM64 still forbids IRIW reordering, but only a few machines can do that in practice, notably POWER)

Other ISAs with both L and S below the bar need extra barriers as part of their seq_cst load and seq_cst store recipes (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html).