Home > Software design >  Changing code from Sequntial Consistency to a less stringent ordering
Changing code from Sequntial Consistency to a less stringent ordering

Time:08-18

I came across this code for a simple implementation of a barrier (for code that can't use std::experimental::barrier in C 17 or std::barrier in C 20) in C Concurrency in Action book.

[Edit] A barrier is a synchronization mechanism where a group of threads (the number of threads is passed to the constructor of the barrier) can arrive at and wait (by calling wait method) or arrive at and drop off (by calling done_waiting). If all the threads in the group arrive at the barrier, then the barrier is reset and the threads can proceed with the next set of actions. If some threads in the group drops off, then the number of threads in the group is reduced accordingly for thenext round of synchronization with the barrier. [End of Edit]

Here is the code provided for a simple implementation of a barrier.

struct barrier
{
   std::atomic<unsigned> count;
   std::atomic<unsigned> spaces;
   std::atomic<unsigned> generation;
   barrier(unsigned count_):count(count_),spaces(count_),generation(0)
   {}
   void wait(){
      unsigned const gen=generation.load();
      if(!--spaces){
         spaces=count.load();
           generation;
      }else{
         while(generation.load()==gen){
            std::this_thread::yield();
         }
      }
   }
   void done_waiting(){
      --count;
      if(!--spaces){
         spaces=count.load();
           generation;
      }
   }
};

The author, Anthony Williams, mentions that he chose sequential consistency ordering to make it easier to reason about the code and said that relaxed ordering could be used to make the code more efficient. This is how I changed the code to employ relaxed ordering. Please help me understand if my code is right.

struct barrier
{
   std::atomic<unsigned> count;
   std::atomic<unsigned> spaces;
   std::atomic<unsigned> generation;
   barrier(unsigned count_):count(count_),spaces(count_),generation(0)
   {}
   void wait(){
      unsigned const gen=generation.load(std::memory_order_acquire);
      if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
         spaces=count.load(std::memory_order_relaxed);
         generation.fetch_add(1, std::memory_order_release);
      }else{
         while(generation.load(std::memory_order_relaxed)==gen){
            std::this_thread::yield();
         }
      }
   }
   void done_waiting(){
      count.fetch_sub(1, std::memory_order_relaxed);
      if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
         spaces=count.load(std::memory_order_relaxed);
         generation.fetch_add(1, std::memory_order_release);
      }
   }
};

The reasoning is this. The increment of generation is a release operation that synchronizes with the loading of generation in the wait call. This ensures that the load from count into spaces is visible to all the threads that call wait and read the new value of generation that was stored with release semantics.

All the operations on spaces thereafter are RMW operations which take part in a release sequence and hence can be relaxed operations. Is this reasoning correct or is this code wrong? Please help me understand. Thanks in advance.

[Edit]I tried using my barrier code like this.

void fun(barrier* b){
        std::cout << "In Thread " << std::this_thread::get_id() << std::endl;
        b->wait();
        std::cout << std::this_thread::get_id() << " First wait done" << std::endl;
        b->wait();
        std::cout << std::this_thread::get_id() << " Second wait done" << std::endl;
        b->done_waiting();
}

int main(){
        barrier b{2};
        std::thread t(fun, &b);
        fun(&b);
        std::cout << std::this_thread::get_id() << " " <<  b.get_count() << std::endl;
        t.join();
}

I also tried testing it with many more threads and in superficial runs, it seems to be doing the right thing. But I would still like to understand if my reasoning is correct or if I am missing something very obvious.[End of Edit]

CodePudding user response:

I believe there are at least two bugs.

Let's analyze wait() by itself. Consider code that does:

int x = 0;
barrier b(2);

void thrA() {
    x = 1;
    b.wait();
}

void thrB() {
    b.wait();
    std::cout << x << std::endl;
}

We wish to prove that x = 1 in thrA happens before the evaluation of x in thrB, so that the code would have no data race and be forced to print the value 1.

But I don't think we can. Let's suppose thrB reaches the barrier first, which is to say that it observes spaces.fetch_sub returning 2. So the loads and stores performed in each thread are sequenced as follows:

thrA:
x = 1;
gen=generation.load(std::memory_order_acquire);     // returns 0
spaces.fetch_sub(1, std::memory_order_relaxed);     // returns 1, stores 0
spaces=count.load(std::memory_order_relaxed);       // stores 2
generation.fetch_add(1, std::memory_order_release); // returns 0, stores 1

thrB:
gen=generation.load(std::memory_order_acquire);     // returns 0
spaces.fetch_sub(1, std::memory_order_relaxed);     // returns 2, stores 1
generation.load(std::memory_order_relaxed);         // returns 0
... many iterations
generation.load(std::memory_order_relaxed);         // returns 1
x;                                                  // returns ??

To have any hope, we have to get some operation A in thrA to synchronize with some operation B in thrB. This is only possible if B is an acquire operation that takes its value from a side effect in the release sequence headed by A. But there is only one acquire operation in thrB, namely the initial generation.load(std::memory_order_acquire). And it does not take its value (namely 0) from any of the operations in thrB, but from the initialization of generation that happened before either thread was started. This side effect is not part of any useful release sequence, certainly not of any release sequence headed by an operation which happens after x=1. So our attempt at a proof breaks down.

More informally, if we inspect the sequencing of thrB, we see that the evaluation of x could be reordered before any or all of the relaxed operations. The fact that it's only evaluated conditionally on generation.load(std::memory_order_relaxed) returning 1 doesn't help; we could have x loaded speculatively much earlier, and the value only used after generation.load(std::memory_order_relaxed) finally returns 1. So all we know is that x is evaluated sometime after generation.load(std::memory_order_acquire) returns 0, which gives us precisely no useful information at all about what thrA might or might not have done by then.

This particular issue could be fixed by upgrading the load of generation in the spin loop to acquire, or by placing an acquire fence after the loop exits but before wait() returns.


As for done_waiting, it seems problematic too. If we have

void thrA() {
    x = 1;
    b.done_waiting();
}

void thrB() {
    b.wait();
    std::cout << x;
}

then presumably we again want 1 to be printed, without data races. But suppose thrA reaches the barrier first. Then all that it does is

x = 1;
count.fetch_sub(1, std::memory_order_relaxed); // irrelevant
spaces.fetch_sub(1, std::memory_order_relaxed); // returns 2, stores 1

with no release stores at all, so it cannot synchronize with thrB.

Informally, there is no barrier to prevent the store x=1 from being delayed indefinitely, so there cannot be any guarantee that thrB will observe it.

It is late here, so for the moment I leave it as an exercise how to fix this.


By the way, ThreadSanitizer detects data races for both cases: https://godbolt.org/z/1MdbMbYob. I probably should have tried that first, but initially it wasn't so clear to me what to actually test.


I am not sure at this point if these are the only bugs, or if there are more.

CodePudding user response:

I am also reading that book: which is a pleasent suprise.

Well i ran your code and i got an ifinite loop as the result:

#include<atomic>
#include<thread>
#include<iostream>

struct barrier
{
   std::atomic<unsigned> count;
   std::atomic<unsigned> spaces;
   std::atomic<unsigned> generation;
   barrier(unsigned count_):count(count_),spaces(count_),generation(0)
   {}
   void wait(){
      unsigned const gen=generation.load(std::memory_order_acquire);
      if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
          std::cout << "1 == spaces.fetch_sub(1, std::memory_order_relaxed\n";
         spaces=count.load(std::memory_order_relaxed);
         generation.fetch_add(1, std::memory_order_release);
      }else{
          std::cout << "1 == spaces.fetch_sub(1, std::memory_order_relaxed\n";
         while(generation.load(std::memory_order_relaxed)==gen){
             std::cout << "generation.load(std::memory_order_relaxed)==gen\n";
            std::this_thread::yield();
            std::cout << "std::this_thread::yield()\n";
         }
      }
   }
   void done_waiting(){
      count.fetch_sub(1, std::memory_order_relaxed);
      if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
          std::cout << "1 == spaces.fetch_sub(1, std::memory_order_relaxed)\n";
         spaces=count.load(std::memory_order_relaxed);
         generation.fetch_add(1, std::memory_order_release);
      }
   }
};

int main() {

    barrier b(10);
    b.wait();

    return 0;

It comes up with an ifinite output:

1 == spaces.fetch_sub(1, std::memory_order_relaxed
generation.load(std::memory_order_relaxed)==gen
std::this_thread::yield()
generation.load(std::memory_order_relaxed)==gen
std::this_thread::yield()
generation.load(std::memory_order_relaxed)==gen
std::this_thread::yield()

But this code oviously gets Trucated in the end

But when i ran Anthony's code it resulted in an infinite loop:

#include<atomic>
#include<thread>
#include<iostream>

struct barrier
{
   std::atomic<unsigned> count;
   std::atomic<unsigned> spaces;
   std::atomic<unsigned> generation;
   barrier(unsigned count_):count(count_),spaces(count_),generation(0)
   {}
   void wait(){
      unsigned const gen=generation.load();
      std::cout << "a ";
      if(!--spaces){
          std::cout << "e ";
         spaces=count.load();
           generation;
      }else{
         while(generation.load()==gen){
             std::cout << "b ";
            std::this_thread::yield();
         }
      }
   }
   void done_waiting(){
       std::cout << "c ";
      --count;
      if(!--spaces){
          std::cout << "d ";
         spaces=count.load();
           generation;
      }
   }
};

int main() {

    barrier b(10);
    b.wait();
    
}

output:

a b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b  b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b etc.
[Truncated]

Note: i used std::cout to help for the debugging purpose

So as you said you were tying to replace the code expample from the book with using the relaxed memory ordering as it could be more efficient:

So in answer to your question:

Yes, your code does work as intended.

  • Related