Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other-CodePudding

I have std::atomic<T> atomic_value; (for type T being bool, int32_t, int64_t and any other). If 1st thread does

atomic_value.store(value, std::memory_order_relaxed);

and in 2nd thread at some points of code I do

auto value = atomic_value.load(std::memory_order_relaxed);

How fast is this updated atomic value propagated from 1st thread to 2nd, between CPU cores? (for all CPU models)

Is it propagated almost-immediately? For example up-to speed of cache coherence propagation in Intel, meaning that 0-2 cycles or so. Maybe few more cycles for some other CPU models/manufacturers.

Or this value may stuck un-updated for many many cycles sometimes?

Does atomic guarantee that value is propagated between CPU cores as fast as possible for given CPU?

Maybe if instead on 1st thread I do

atomic_value.store(value, std::memory_order_release);

and on 2nd thread

auto value = atomic_value.load(std::memory_order_acquire);

then will it help to propagate value faster? (notice change of both memory orders) And now with speed guarantee? Or it will be same gurantee of speed as for relaxed order?

As a side question - does replacing relaxed order with release acquire also synchronizes all modifications in other (non-atomic) variables?

Meaning that in 1st thread everything that was written to memory before store-with-release, is this whole memory guaranteed in 2nd thread to be exactly in final state (same as in 1st thread) at point of load-with-acquire, of course in a case if loaded value was new one (updated).

So this means that for ANY type of std::atomic<> (or std::atomic_flag) point of store-with-release in one thread synchronizes all memory writes before it with point in another thread that does load-with-acquire of same atomic, in a case of course if in other thread value of atomic got updated? (Sure if value in 2nd thread is not yet new then we expect that memory writes have not yet finished)

PS. Why question arose... Because according to name "atomic" it is obvious to conclude (probably miss-conclude) that by default (without extra constraints, i.e. with just relaxed memory order) std::atomic<> just makes any arithmetic operation atomical, and nothing else, no other guarantees about synchronization or speed of propagation. Meaning that write to memory location will be whole (e.g. all 4 bytes at once for int32_t), or exchange with atomic location will do both read-write atomically (actually in a locked fashion), or incrementing a value will do atomically three operations read-add-write.

CodePudding user response：

The C standard says only this [C 20 intro.progress p18]:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

Technically this is only a "should", and "finite time" is not very specific. But the C standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.

In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.

(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)

A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.

Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.