C - volatile and memory barriers in lockless shared memory access?-CodePudding

Hi I had a general question regarding usage of volatile and memory barriers in C while making memory changes in shared memory being concurrently accessed by multiple threads without locks. As I understand volatile and memory barriers serve the following general purposes

memory barriers

A) make sure all pending memory accesses (read/writes(depending on the barrier)) have been properly completed before the barrier and only then the memory accesses following the barrier are executed.

B) Make sure that the compiler does not reorder load/store instructions(depending on the barrier) across the barriers.

Basically the purpose of point A is to handle out of order execution and write buffer flush delay scenarios where the processor itself ends up reordering instructions generated by the compiler OR memory accesses made by the said instructions. The purpose of the point B is that when C code is translated to machine code the compiler does not itself move those accesses in assembly around.

Now for volatile volatile is basically meant in a loose way so that so that the compiler does not perform its optimisations while optimising code written with volatile variables. The following broad purposes are served

A) memory accesses are not cached in cpu registers when translating C code to machine level code and every time a read in code happens it’s converted into a load instruction to be done through the memory in assembly.

B) relative order of memory accesses in assembly with other volatile variables are kept in the same order when the compiler transforms C code to machine code while the memory accesses in assembly with non volatile variables can be interleaved.

I have the following questions

is my understanding correct and complete ? Like are there cases I am missing or something I am saying incorrect.
so then whenever we are writing code making memory changes in shared memory being concurrently accessed by multiple threads we need to make sure we have barriers so that behaviour corresponding to point 1.A and 1.B doesn’t happen. The behaviour corresponding to 2.B will be handled by 1.B and for 2.A we need to cast our pointer to a volatile pointer for the access. Basically I am trying to understand should we always be casting the pointer to a volatile pointer and then making the memory access so that we are sure 2.A doesn’t happen or are there are cases where only using barriers suffice ?

CodePudding user response：

is my understanding correct and complete ?

Yeah, it looks that way, except for not mentioning that C11 <stdatomic.h> made all this obsolete for almost all purposes.

There are more bad/weird things that can happen without volatile (or better, _Atomic) that you didn't list: the LWN article Who's afraid of a big bad optimizing compiler? goes into detail about things like inventing extra loads (and expecting them both to read the same value). It's aimed at Linux kernel code, where C11 _Atomic isn't how they do things.

Other than the Linux kernel, new code should pretty much always use <stdatomic.h> instead of rolling your own atomics with volatile and inline asm for RMWs and barriers. But that does continue to work because all real-world CPUs that we run threads across have coherent shared memory, so making a memory access happen in the asm is enough for inter-thread visibility, like memory_order_relaxed. See When to use volatile with multi threading? (basically never, except in the Linux kernel or maybe a handful of other codebases that already have good implementations of hand-rolled stuff).

In ISO C11, it's data-race undefined behaviour for two threads to do unsynchronized read write on the same object, but mainstream compilers do define the behaviour, just compiling the way you'd expect so hardware guarantees or lack thereof come into play.

Other than that, yeah, looks accurate except for your final question 2: there are use-cases for memory_order_relaxed atomics, which is like volatile with no barriers, e.g. an exit_now flag.

or are there are cases where only using barriers suffice ?

No, unless you get lucky and the compiler happens to generate correct asm anyway.

Or unless other synchronization means this code only runs while no other threads are reading/writing the object. (C 20 has std::atomic_ref<T> to handle the case where some parts of the code need to do atomic accesses to data, but other parts of your program don't, and you want to let them auto-vectorize or whatever. C doesn't have any such thing yet, other than using plain variables with/without GNU C __atomic_load_n() and other builtins, which is how C headers implement std::atomic<T>, and which is the same underlying support that C11 _Atomic compiles to. Probably also the C11 functions like atomic_load_explicit defined in stdatomic.h, but unlike C , _Atomic is a true keyword not defined in any header.)