Is the implementation of the hazard pointer in C Concurrency in Action flawed?-CodePudding

I am reading the second edition of C Concurrency in Action. The code below is from Listing 7.6. It implements pop() for the stack using hazard pointers.

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
  hp.store(nullptr);
  // ...
}

The book explains the role of the inner loop:

You have to do this in a while loop to ensure that the node hasn't been deleted between the reading of the old head pointer #1 and the setting of the hazard pointer #2. During this window no other thread knows you're accessing this particular node. Fortunately, if the old head node is going to be deleted, head itself must have changed, so you can check this and keep looping until you know that the head pointer still has the same value you set your hazard pointer to #4.

According to the implementation of pop, if another thread deletes the head node by pop between #1 and #2, then head will be modified to the new node.

What confuses me is, can the modification of head by other threads be seen by the current thread in time? For example, if the new value of head has not been propagated to the current thread, then #1 and #3 will still read the same value (the old value), causing the inner while loop to exit and in turn the outer while loop accessing old_head->next, resulting in undefined behavior.

I've searched stackoverflow for some answers. For example, this answer says

The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.

This comment says

Every atomic variable has its own modification order that all threads can agree on, but that only serializes modifications, not reads. The coherency requirements involving reads just guarantee that if you've seen one value in the modification order, you can't see an earlier one.

But cppreference says

Each memory_order_seq_cst operation B that loads from atomic variable M, observes one of the following:

the result of the last operation A that modified M, which appears before B in the single total order

...

So what exactly should the answer to my question be?

Also, if a weaker ordering (like release-acquire or relaxed) is used here, would the above problem arise?

Here is the code of pop:

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
  hp.store(nullptr);  // Clear hazard pointer once you're finished
  std::shared_ptr<T> res;
  if (old_head) {
    res.swap(old_head->data);
    if (outstanding_hazard_pointers_for(old_head)) // Check for hazard pointers referencing a node before you delete it.
      reclaim_later(old_head);
    else
      delete old_head;
    delete_nodes_with_no_hazards();
  }
  return res;
}

pop() pops the node pointed to by head and frees it when no hazard pointer points to it. Modifying the head is achieved by compare_exchange_strong.

CodePudding user response：

No, I don't think it's flawed.

To verify that the algorithm is correct, let me annotate a few more lines of code here. I rewrote line 8 to make it clear that it does a load from the other thread's hazard pointer.

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
                                 // #5 deref of old_head
                                 // #6 the compare_exchange
  hp.store(nullptr);             // #7 
  std::shared_ptr<T> res;
  if (old_head) {
    res.swap(old_head->data);
    if (other_thread_hp.load() == old_head)
                                 // #8
      reclaim_later(old_head);
    else
      delete old_head;           // #9
    delete_nodes_with_no_hazards();
  }
  return res;
}

Informal justification

Suppose thread A is trying to safely dereference old_head while Thread B wants to delete it.

It is entirely possible that, for instance, the compare_exchange_strong in thread B line 6 (B6 for short) occurs between the loads A1 and A3 with respect to real time, but does not become visible to thread A until later.

However, we have sequential consistency. Every thread must observe the operations B6 and B8 in that order. So if B6 didn't become visible to A until after A3, then B8 cannot become visible to A until later still, and by that time, store A2 will have taken effect. In other words, load B8 must observe all stores that preceded A3, including in particular A2. So either B8 will return the non-null hazard pointer from A, in which case the delete is not done; or else it returns nullptr, which can only happen if store A7 has become visible, and by that time the dereference A5 has already completed.

Thus, if there can be a delay between when B6 is executed and when it becomes globally visible, then the implementation must arrange that all subsequent operations in thread B, including in particular B8, are stalled until after B6 does become visible. On present-day hardware, the typical reason for such a delay is that the store at B6 went into a store buffer. So to ensure sequential consistency, the compiler has to insert a barrier instruction between B6 and B8, which will wait for the store buffer to drain and commit to coherent L1 cache before continuing on.

Edit: To your question in comments about delaying non-atomic operations: Things are complicated a bit because B6 is a read-modify-write, but for memory ordering purposes, you can think of it as consisting of a load (B6L) and a store (B6S), both of which are seq_cst. For purposes of ordering with respect to non-seq_cst operations, including all non-atomic operations, a seq_cst store is simply a release store, and a seq_cst load is merely an acquire load. So indeed, non-atomic operations that follow B6S do not have to "stall" and, unless otherwise restricted, may become visible before B6S does. (They cannot become visible before B6L however, because it is acquire.)

But by the same token, B8 is acquire. This does require later operations, including non-atomic operations such as B9, to stall until after B8 has become visible. (Here B9 is on one side of a conditional branch which depends on the value loaded by B8, but without the acquire barrier, it could still start doing loads speculatively.) So indeed, the net result is that B9 must become visible only after B6S, because B9 must wait for B8 which must wait for B6S.

Formal proof of correctness

Remember that the C memory model has no concept of "in time" or "stale"; everything is defined in terms of abstract orderings. Here all the atomic operations are seq_cst by default, so there is a total order on all of them, with coherency rules ensuring that they observe each other in all the usual ways.

We'll write A1, B1, etc, for operation #1 as executed by Threads A or B respectively. Also, let hpA, hpB be the hazard pointers belonging to the corresponding threads. Let H0 be the value of head going into the code here, which both threads initially load as their old_head, and H1, H2 the addresses of the following nodes.

We want to make sure that A5 happens before B9. If A5 is going to dereference pointer H0, it must be that loads A1, A3 both returned H0, which means that A2 stored H0 to hpA. (Or if A1 didn't, then both the second-to-last and last iterations of A3 must have loaded H0; either way the conclusion is that A2 stored H0.)

Likewise, if B6 is going to delete H0, it must be that B6 loaded H0 from head and stored H1. So if both these conditions hold, then A3 precedes B6 in the total order (or A3 < B6 for short); otherwise A3 would have loaded H1 instead. By transitivity, and the fact that the seq_cst total order is consistent with sequencing (program order), we have A2 < A3 < B6 < B8.

Now since it is a total order, either A7 < B8 or A7 > B8.

If A7 < B8, then B8 observes nullptr stored by A7. Since A7 was a release store (implied by seq_cst), and B8 was an acquire load that observed the value stored by A7, we have that A7 happens before B8. By sequencing A5 happens before A7, and B8 happens before B9, so A5 happens before B9. Thus B9 will delete H0, but A5 has already finished dereferencing it and there is no data race or use-after-free.
If A7 > B8, then we have A3 < B8 < A7. So B8 must observe the store A3 (which stored H0 to hpA), and it must not observe store A7 which stored nullptr. So in this case B8 loads the value H0, which equals thread B's old_head, and the deletion of H0 is deferred by thread B. Thus the dereference at A5 was safe, because H0 is not being deleted at all.

Acquire-release ordering would not be good enough for this algorithm. Informally, acquire-release forbids LoadLoad, StoreStore, and LoadStore reordering, but still allows for StoreLoad reordering. As such, A2 could become visible after A3, which would be disastrous. Edit: And to your comment below, yes, B6S could also become visible after B8 and B9 (with B7 becoming visible later still).

Relaxed ordering would be even worse; for instance, A7 could become visible before A5 has completed.