Home > Back-end >  Any operation/fence available weaker than release but still offering synchronize-with semantic?
Any operation/fence available weaker than release but still offering synchronize-with semantic?

Time:01-30

std::memory_order_release and std::memory_order_acquire operations provide the synchronize-with semantic.

In addition to that, std::memory_order_release guarantees that all loads and stores can't be reordered past the release operation.

Questions:

  1. Is there anything in C 20/23 that provides the same synchronized-with semantic but isn't as strong as std::memory_order_release such that loads can be reordered past the release operation? In a hope that the out-of-order code is more optimized (by compiler or by CPU).
  2. Let's say there is no such thing in C 20/23, is there any no standard way to do so (e.g. some inline asm) for x86 on linux?

CodePudding user response:

ISO C only has three orderings that apply to stores: relaxed, release and seq_cst. Relaxed is clearly too weak, and seq_cst is strictly stronger than release. So, no.

The property that neither loads nor stores may be reordered past a release store is necessary to provide the synchronize-with semantics that you want, and can't be weakened in any way I can think of without breaking them. The point of synchronize-with is that a release store can be used as the end of a critical section. Operations within that critical section, both loads and stores, have to stay there.

Consider the following code:

std::atomic<bool> go{false};
int crit = 17;

void thr1() {
    int tmp = crit;      // .load(seq_cst)
    go.store(true, std::memory_order_release);
    std::cout << tmp << std::endl;
}

void thr2() {
    while (!go.load(std::memory_order_acquire)) {
        // delay
    }
    crit = 42;
}

This program is free of data races and must output 17. This is because the release store in thr1 synchronizes with the final acquire load in thr2, the one that returns true (thus taking its value from the store). This implies that the load of crit in thr1 happens-before the store in thr2, so they don't race, and the load does not observe the store.

If we replaced the release store in thr1 with your hypothetical half-release store, such that the load of crit could be reordered after go.store(true, half_release), then that load might take place any amount of time later. It could in particular happen concurrently with, or even after, the store of crit in thr2. So it could read 42, or garbage, or anything else could happen. This should not be possible if go.store(true, half_release) really did synchronize with go.load(acquire).

CodePudding user response:

ISO C

In ISO C , no, release is the minimum for the writer side of doing some (possibly non-atomic) stores and then storing a data_ready flag. Or for locking / mutual exclusion, to keep loads before a release store and stores after an acquire load (no LoadStore reordering). Or anything else happens-before gives you. (C 's model works in terms of guarantees on what a load can or must see, not in terms of local reordering of loads and stores from a coherent cache. I'm talking about how they're mapped into asm for normal ISAs.) acq_rel RMWs or seq_cst stores or RMWs also work, but are stronger than release.


Asm with weaker guarantees that might be sufficient for some cases

In asm for some platform, perhaps there might be something weaker you could do, but it wouldn't be fully happens-before. I don't think there are any requirements on release which are superfluous to happens-before and normal acq/rel synchronization. (https://preshing.com/20120913/acquire-and-release-semantics/).

Some common use cases for acq/rel sync only needs StoreStore ordering on the writer side, LoadLoad on the reader side. (e.g. producer / consumer with one-way communication, non-atomic stores and a data_ready flag.) Without the LoadStore ordering requirement, I could imagine either the writer or reader being cheaper on some platforms.

Perhaps PowerPC or RISC-V? I checked what compilers do on Godbolt for a.load(acquire) and a.store(1, release).

# clang(trunk) for RISC-V -O3
load(std::atomic<int>&):     # acquire
        lw      a0, 0(a0)    # apparently RISC-V just has barriers, not acquire *operations*
        fence   r, rw        # but the barriers do let you block only what is necessary
        ret
store(std::atomic<int>&):    # release
        fence   rw, w
        li      a1, 1
        sw      a1, 0(a0)
        ret

If fence r and/or fence w exist and are ever cheaper than fence r,rw or fence rw, w, then yes, RISC-V can do something slightly cheaper than acq/rel. Unless I'm missing something, that would still be strong enough if you just want loads after an acquire load see stores from before a release store, but don't care about LoadStore: Others loads staying before a release store, and others stores staying after an acquire load.

CPUs naturally want to load early and store late to hide latencies, so it's usually not much of a burden to actually block LoadStore reordering on top of blocking LoadLoad or StoreStore. At least that's true for an ISA as long as it's possible to get the ordering you need without having to use a much stronger barrier. (i.e. when the only option that meets the minimum requirement is far beyond it, like 32-bit ARMv7 where you'd need a dsb ish full barrier that also blocked StoreLoad.)


release is free on x86; other ISAs are more interesting.

memory_order_release is basically free on x86, only needing to block compile-time reordering. (See C How is release-and-acquire achieved on x86 only using MOV? - The x86 memory model is program order plus a store-buffer with store forwarding).

x86 is a silly choice to ask about; something like PowerPC where there are multiple different choices of light-weight barrier would be more interesting. Turns out it only needs one barrier each for acquire and release, but seq_cst needs multiple different barriers before and after.

PowerPC asm looks like this for load(acquire) and store(1,release) -

load(std::atomic<int>&):
        lwz %r3,0(%r3)
        cmpw %cr0,%r3,%r3     #; I think for a data dependency on the load
        bne- %cr0,$ 4         #; never-taken, if I'm reading this right?
        isync                 #; instruction sync, blocking the front-end until older instructions retire?
        blr
store(std::atomic<int>&):
        li %r9,1
        lwsync               # light-weight sync = LoadLoad   StoreStore   LoadStore.  (But not blocking StoreLoad)
        stw %r9,0(%r3)
        blr

I don't know if isync is always cheaper than lwsync which I'd think would also work there; I'd have thought stalling the front-end might be worse than imposing some ordering on loads and stores.

I suspect the reason for the compare-and-branch instead of just isync (documentation) is that a load can retire from the back-end ("complete") once it's known to be non-faulting, before the data actually arrives.

(x86 doesn't do this, but weakly-ordered ISAs do; it's how you get LoadStore reordering on CPUs like ARM, with in-order or out-of-order exec. Retirement goes in program order, but stores can't commit to L1d cache until after they retire. x86 requiring loads to produce a value before they can retire is one way to guarantee LoadStore ordering. How is load->store reordering possible with in-order commit?)

So on PowerPC, the compare into condition-register 0 (%cr0) has a data dependency on the load, can can't execute until the data arrives. Thus can't complete. I don't know why there's also an always-false branch on it. I think the $ 4 branch destination is the isync instruction, in case that matters. I wonder if the branch could be omitted if you only need LoadLoad, not LoadStore? Unlikely.


IDK if ARMv7 can maybe block just LoadLoad or StoreStore. If so, that would be a big win over dsb ish, which compilers use because they also need to block LoadStore.


Loads cheaper than acquire: memory_order_consume

This is the useful hardware feature that ISO C doesn't currently expose (because std::memory_order_consume is defined in a way that's too hard for compilers to implement correctly in every corner case, without introducing more barriers. Thus it's deprecated, and compilers handle it the same as acquire).

Dependency ordering (on all CPUs except DEC Alpha) makes it safe to load a pointer and deref it without any barriers or special load instructions, and still see the pointed-to data if the writer used a release store.

If you want to do something cheaper than ISO C acq/rel, the load side is where the savings are on ISAs like POWER and ARMv7. (Not x86; full acquire is free). To a much lesser extent on ARMv8 I think, as ldapr should be cheapish.

See C 11: the difference between memory_order_relaxed and memory_order_consume for more, including a talk from Paul McKenney about how Linux uses plain loads (effectively relaxed) to make the read side of RCU very very cheap, with no barriers, as long as they're careful to not write code where the compiler can optimize away the data dependency into just a control dependency or nothing.

Also related:

  • Related