What exactly is the problem that memory barriers deal with?-CodePudding

I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.

I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores

This is pseudo-code!

We start with int f = 0; int x = 0; and then run those threads:

# Thread 1

while(f == 0);

print(x)

# Thread 2 

x = 42;
f = 1;

Of course, the desired result of this program is that thread 1 will print 42.

NOTE: I leave "compile-time reordering" out of this discussion, I only want to focus on what happens in runtime, so ignore all kinds of optimizations that the compiler might do.

Ok so from what I understand the problem here is what is called "memory reordering": the CPU is free to reorder memory operations as long as the end result is what the program expects it to be. In this case, within thread 2, the f = 1 may be executed before x = 42. In this case, thread 1 will print 0, which is not what the programmer want.

At this point, Wikipedia points at another possible scenario that may occur:

Similarly, thread #1's load operations may be executed out-of-order and it is possible for x to be read before f is checked

Since we're talking right now about "out of order execution" - let's ignore for a moment from the caches of the cores. So let's analyze what happens here. Start with thread 2 - the compiled instructions will look (in pseudo-assembly) something like:

1 put 42 into register1
2 write register1 to memory location of x
3 put 1 into register 2
4 write register2 to memory location of f

Ok so I understand that 3-4 may be executed before 1-2. But I don't understand the equivalent in thread 1:

Let's say the instructions of thread 1 will be something like:

1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2

What exactly may be out of order here? 3 can be before 1-2?

Let's go on: Up until now we talked about out-of-order execution, which brings me to my primary confusion:

In this great post the author describes the problem as such: Each core has its own cache, and the core does the memory operations against the cache, not against the main memory. The movement of memory from the core-specific caches to the main memory (or a shared cache) occurs in unpredictable time and order. So in our example - even if thread 2 will execute its instructions in-order - the writing of x=42 will occur before f=1, but that will be only to the cache of core2. The movement of these values to a shared memory may be in the opposite order, and hence the problem.

So I don't understand - when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?

CodePudding user response：

when we talk about "memory reordering" - do we talk about Out-of-order execution, or are we talking about the movement of data across caches?

When a thread observes changes of values in a particular order, then from the programmer's perspective it is indistinguishable whether that was due to out-of-order execution of loads, a store buffer delaying stores relative to loads and possibly letting them commit out of order (regardless of execution order), or (hypothetically in a CPU without coherent cache) cache synchronization.

Or even by forwarding store data between logical cores without going through cache, before it commits to cache and becomes visible to all cores. Some POWER CPUs can do this in real life but few if any others.

Real CPUs have coherent caches; once a value commits to cache, it's visible to all cores; it can't happen until other copies are already invalidated, so this is not the mechanism for reading "stale" data. Memory reordering on real-world CPUs is something that happens within a core, with reads and writes of coherent cache possibly happening in a different order than program order. Cache doesn't re-sync after getting out of sync; it maintains coherency in the first place.

The important effect, regardless of mechanism, is that another thread observing the same variables you're reading/writing, can see the effects happen in a different order than assembly program-order.

CodePudding user response：

The two mail questions you have both have the same answer (Yes!), but for different reasons.

First let's look at this particular piece of pseudo-machine-code

Let's say the instructions of thread 1 will be something like:
1 load f to register1
2 if f is 0 - jump to 1
3 load x to register2
4 print register2
What exactly may be out of order here? 3 can be before 1-2?

To answer your question, this is a reverberating "YES!". Since the contents of register1 are not tied in any way to the contents of register2 the CPU may happily (and correctly, for that matter) preload register2, so that when the the 1,2 loop finally breaks, it can immediately go to 4.

For a practical example, register1 might be an I/O peripheral register tied to a polled serial clock, and the CPU is just waiting for the clock to transition to low, so that it can bit-bang the next value onto the data output lines. Doing it that way for one saves precious time on the data fetch and more importantly may avoid contention on the peripheral data bus.

So, yes, this kind of reordering is perfectly fine and allowed, even with optimizations turned off and happening on a single threaded, single core CPU. The only way to make sure that register2 is definitely read, after the loop breaks, is to insert a barrier.

The second question is about cache coherence. And again, the answer to the need of memory barriers is "yes! you need them". Cache coherence is an issue, because modern CPUs don't talk to the system memory directly, but through their caches. As long as you're dealing with only a single CPU core, and a single cache, coherence is not an issue, since all the threads running on the same core do work against the same cache. However the moment you have multiple cores with independent caches, their individual views of the system memory contents may differ, and some form of memory consistency model is required. Either through explicit insertion of memory barriers, or on the hardware level.

CodePudding user response：

For my point of view you missed the most important thing!

As the compiler did not see that the change of x nor f has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0 will result in "nothing" as the compiler only sees that you propagate a constant for f=0 before, it can assume that f==0 will always be true and optimize it away.

And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic vars.

If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.

If we implement the code as the following example, we see it fails and print 0 on my system.

int main()
{
    int f = 0; 
    int x = 0; 

    std::thread s( [&f,&x](){ x=42; f=1; } ); 

    while( f==0);
    std::cout << x << std::endl;

    s.join();
}

and if we change int f = 0; to std::atomic<int> f = 0 we get the expected result.