I am reading the second edition of C Concurrency in Action. The code below is from Listing 7.6. It implements pop()
for the stack using hazard pointers.
std::shared_ptr<T> pop() {
std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
node* old_head = head.load(); // #1
do {
node* temp;
do {
temp = old_head;
hp.store(old_head); // #2
old_head = head.load(); // #3
} while (old_head != temp); // #4
} while (old_head &&
!head.compare_exchange_strong(old_head, old_head->next));
hp.store(nullptr);
// ...
}
The book explains the role of the inner loop:
You have to do this in a
while
loop to ensure that thenode
hasn't been deleted between the reading of the oldhead
pointer#1
and the setting of the hazard pointer#2
. During this window no other thread knows you're accessing this particular node. Fortunately, if the oldhead
node is going to be deleted,head
itself must have changed, so you can check this and keep looping until you know that thehead
pointer still has the same value you set your hazard pointer to#4
.
According to the implementation of pop
, if another thread deletes the head node by pop
between #1
and #2
, then head
will be modified to the new node.
What confuses me is, can the modification of head
by other threads be seen by the current thread in time? For example, if the new value of head
has not been propagated to the current thread, then #1
and #3
will still read the same value (the old value), causing the inner while
loop to exit and in turn the outer while
loop accessing old_head->next
, resulting in undefined behavior.
I've searched stackoverflow for some answers. For example, this answer says
The default memory ordering of
std::memory_order_seq_cst
provides a single global total order for allstd::memory_order_seq_cst
operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
This comment says
Every atomic variable has its own modification order that all threads can agree on, but that only serializes modifications, not reads. The coherency requirements involving reads just guarantee that if you've seen one value in the modification order, you can't see an earlier one.
But cppreference says
Each
memory_order_seq_cst
operationB
that loads from atomic variableM
, observes one of the following:
- the result of the last operation
A
that modifiedM
, which appears before B in the single total order...
So what exactly should the answer to my question be?
Also, if a weaker ordering (like release-acquire or relaxed) is used here, would the above problem arise?
Here is the code of pop
:
std::shared_ptr<T> pop() {
std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
node* old_head = head.load(); // #1
do {
node* temp;
do {
temp = old_head;
hp.store(old_head); // #2
old_head = head.load(); // #3
} while (old_head != temp); // #4
} while (old_head &&
!head.compare_exchange_strong(old_head, old_head->next));
hp.store(nullptr); // Clear hazard pointer once you're finished
std::shared_ptr<T> res;
if (old_head) {
res.swap(old_head->data);
if (outstanding_hazard_pointers_for(old_head)) // Check for hazard pointers referencing a node before you delete it.
reclaim_later(old_head);
else
delete old_head;
delete_nodes_with_no_hazards();
}
return res;
}
pop()
pops the node pointed to by head
and frees it when no hazard pointer points to it. Modifying the head
is achieved by compare_exchange_strong
.
CodePudding user response:
No, I don't think it's flawed.
To verify that the algorithm is correct, let me annotate a few more lines of code here. I rewrote line 8 to make it clear that it does a load from the other thread's hazard pointer.
std::shared_ptr<T> pop() {
std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
node* old_head = head.load(); // #1
do {
node* temp;
do {
temp = old_head;
hp.store(old_head); // #2
old_head = head.load(); // #3
} while (old_head != temp); // #4
} while (old_head &&
!head.compare_exchange_strong(old_head, old_head->next));
// #5 deref of old_head
// #6 the compare_exchange
hp.store(nullptr); // #7
std::shared_ptr<T> res;
if (old_head) {
res.swap(old_head->data);
if (other_thread_hp.load() == old_head)
// #8
reclaim_later(old_head);
else
delete old_head; // #9
delete_nodes_with_no_hazards();
}
return res;
}
Informal justification
Suppose thread A is trying to safely dereference old_head
while Thread B wants to delete it.
It is entirely possible that, for instance, the compare_exchange_strong
in thread B line 6 (B6 for short) occurs between the loads A1 and A3 with respect to real time, but does not become visible to thread A until later.
However, we have sequential consistency. Every thread must observe the operations B6 and B8 in that order. So if B6 didn't become visible to A until after A3, then B8 cannot become visible to A until later still, and by that time, store A2 will have taken effect. In other words, load B8 must observe all stores that preceded A3, including in particular A2. So either B8 will return the non-null hazard pointer from A, in which case the delete is not done; or else it returns nullptr
, which can only happen if store A7 has become visible, and by that time the dereference A5 has already completed.
Thus, if there can be a delay between when B6 is executed and when it becomes globally visible, then the implementation must arrange that all subsequent operations in thread B, including in particular B8, are stalled until after B6 does become visible. On present-day hardware, the typical reason for such a delay is that the store at B6 went into a store buffer. So to ensure sequential consistency, the compiler has to insert a barrier instruction between B6 and B8, which will wait for the store buffer to drain and commit to coherent L1 cache before continuing on.
Edit: To your question in comments about delaying non-atomic operations: Things are complicated a bit because B6 is a read-modify-write, but for memory ordering purposes, you can think of it as consisting of a load (B6L) and a store (B6S), both of which are seq_cst
. For purposes of ordering with respect to non-seq_cst
operations, including all non-atomic operations, a seq_cst
store is simply a release store, and a seq_cst
load is merely an acquire load. So indeed, non-atomic operations that follow B6S do not have to "stall" and, unless otherwise restricted, may become visible before B6S does. (They cannot become visible before B6L however, because it is acquire.)
But by the same token, B8 is acquire. This does require later operations, including non-atomic operations such as B9, to stall until after B8 has become visible. (Here B9 is on one side of a conditional branch which depends on the value loaded by B8, but without the acquire barrier, it could still start doing loads speculatively.) So indeed, the net result is that B9 must become visible only after B6S, because B9 must wait for B8 which must wait for B6S.
Formal proof of correctness
Remember that the C memory model has no concept of "in time" or "stale"; everything is defined in terms of abstract orderings. Here all the atomic operations are seq_cst
by default, so there is a total order on all of them, with coherency rules ensuring that they observe each other in all the usual ways.
We'll write A1, B1, etc, for operation #1 as executed by Threads A or B respectively. Also, let hpA, hpB
be the hazard pointers belonging to the corresponding threads. Let H0
be the value of head
going into the code here, which both threads initially load as their old_head
, and H1
, H2
the addresses of the following nodes.
We want to make sure that A5 happens before B9. If A5 is going to dereference pointer H0
, it must be that loads A1, A3 both returned H0
, which means that A2 stored H0
to hpA
. (Or if A1 didn't, then both the second-to-last and last iterations of A3 must have loaded H0
; either way the conclusion is that A2 stored H0
.)
Likewise, if B6 is going to delete H0
, it must be that B6 loaded H0
from head
and stored H1
. So if both these conditions hold, then A3 precedes B6 in the total order (or A3 < B6 for short); otherwise A3 would have loaded H1
instead. By transitivity, and the fact that the seq_cst total order is consistent with sequencing (program order), we have A2 < A3 < B6 < B8.
Now since it is a total order, either A7 < B8 or A7 > B8.
If A7 < B8, then B8 observes
nullptr
stored by A7. Since A7 was a release store (implied byseq_cst
), and B8 was an acquire load that observed the value stored by A7, we have that A7 happens before B8. By sequencing A5 happens before A7, and B8 happens before B9, so A5 happens before B9. Thus B9 will deleteH0
, but A5 has already finished dereferencing it and there is no data race or use-after-free.If A7 > B8, then we have A3 < B8 < A7. So B8 must observe the store A3 (which stored
H0
tohpA
), and it must not observe store A7 which storednullptr
. So in this case B8 loads the valueH0
, which equals thread B'sold_head
, and the deletion ofH0
is deferred by thread B. Thus the dereference at A5 was safe, becauseH0
is not being deleted at all.
Acquire-release ordering would not be good enough for this algorithm. Informally, acquire-release forbids LoadLoad, StoreStore, and LoadStore reordering, but still allows for StoreLoad reordering. As such, A2 could become visible after A3, which would be disastrous. Edit: And to your comment below, yes, B6S could also become visible after B8 and B9 (with B7 becoming visible later still).
Relaxed ordering would be even worse; for instance, A7 could become visible before A5 has completed.