What happens to outstanding stores after an object is deleted?-CodePudding

Consider the following simple function (assume most compiler optimizations turned off) being executed by two threads on different cores on an X86 CPU with a store buffer:

struct ABC
{
  int x;
  //other members.
};
void dummy(int index)
{
  while(true)
  {
    auto abc = new ABC;
    abc->x = index;
    cout << abc->x;
    // do some other things.
    delete abc;
  }
}

Here, index is the index of the thread; 1 passed by thread1 and 2 passed by thread2. So, thread1 should always print 1 and thread2 should always print 2.

Can there be a situation where the store to x is put in the store buffer and is committed after delete is executed? Or is there an implicit memory barrier that ensures the store is committed before delete? Or is it that any outstanding stores are just discarded once delete is encountered?

Situation where this becomes important:

Since delete returns the memory of the object to the free list (with libc), it is possible that a piece of memory that was just free'd in thread1 is returned by the new operator in thread2 (not only the virtual address, even the underlying physical address returned can be the same). If outstanding stores can execute after delete, it is possible that after thread2 sets abc->x to 2, some older outstanding store from thread1 overwrites it to 1.

This means that in the above program, thread2 can print 1 which is absolutely wrong. Thread1 and thread2 are completely independent and there is no data sharing between the threads from programmer's point of view and they should not have to worry about any synchronization.

What am I missing here?

CodePudding user response：

The CPU has to preserve the illusion of instructions executing one at a time, in program order, for a single thread. This is the cardinal rule of OoO exec. This means tracking what program order was, and making sure loads always see values consistent with that, and that values eventually written to cache are also consistent.

Loads by this core snoop the store buffer, forwarding data from it if the load is reloading a store that hasn't committed yet.

And for any individual memory location, making sure its modification order matches program order, i.e. not reordering stores to the same location. So the final value after the dust settles is the last one in program order. And even observation by other threads will see a consistent modification order for that location; that's why std::atomic is able to provide the guarantee that a modification order exists for every object separately, not having extra changes to A then B then back to A if program order stored B then A. ISO C can guarantee this because all real-world CPUs also guarantee it.

A system call like munmap is a special case, but otherwise new/delete (and malloc/free) aren't special as far as the CPU is concerned: putting a block on the free list and having other code allocate it is just another case of messing around with pointer-based data structures. As always, the CPU tracks any reordering its doing to make sure loads see correct values.

If the memory is placed on a global free-list that could be reused by another thread, a thread-safe allocator will have used sufficient synchronization to create a C happens-before relationship between the code that previously used then deleted the memory, and the code in another thread that just allocated it.

So any old-thread stores into this memory block will already be visible to the thread that just allocated the memory. So they won't step on its stores. If the new thread passes a pointer to this memory to a 3rd thread, it had better use acq/rel or consume/release synchronization itself to make sure that 3rd thread sees its stores, not still stores from the first thread.

It doesn't work like reordering statements in the C source; that would violate the as-if rule. The C as-if rule is the same idea that CPUs follow as they execute, just that the ISA's rules are what govern the requirements on external observables. No ISA has memory-ordering rules as weak as ISO C , e.g. they all guarantee a coherent shared cache, and many CPU ISAs don't have UB. (Although some do, e.g. calling it "unpredictable" behaviour.)

If the free involves a munmap that uses a syscall instruction to run kernel code that changes page tables (to invalidate a mapping so loads/stores to it would fault), that itself will provide sufficient serialization. Existing CPUs don't rename the privilege level, so they don't do out-of-order exec into the kernel through a syscall instruction.

It's up to the OS to do sufficient memory-barriering around modifying page tables, although on x86-64 invlpg is already a serializing instruction. (In x86 terminology, that means draining the ROB and store buffer, so all previous instructions are fully done executing with their results written back to L1d cache (for stores).) So there's no possibility of it reordering with earlier loads / stores that depend on that TLB entry, even apart from the switch to kernel mode.

(Switching into kernel mode doesn't necessarily drain the store buffer, though; the physical address of those stores are known. The TLB checks were done as the store-address uops were executed. So changes to the page tables don't affect the process of committing them to memory.)

Fun fact: on strongly-ordered x86 specifically, store and load ordering need to be more closely tied together than most ISAs; Intel calls the combination of store buffer load buffer the Memory Order Buffer, because it also has to detect cases where a load took a value early, before it was architecturally allowed to (LoadLoad ordering), but then it turns out this core lost access to the cache line. Or in case of mis-speculation about store-forwarding, e.g. dynamically predicting that a load would be reloading a store from an unknown address, but then it turns out the store was non-overlapping. In either case, the CPU rewinds the out-of-order back-end back to a consistent retirement state. (This is called a pipeline nuke; this specific cause is counted by the machine_clears.memory_ordering perf event.)

CodePudding user response：

According to C 20 (new.delete.dataraces/p1) we have the following guarantee:

Calls to these functions that allocate or deallocate a particular unit of storage shall occur in a single total order, and each such deallocation call shall happen before (6.9.2) the next allocation (if any) in this order.

Since every delete happens before any new then what is sequenced before these operators also happens before these other invocations. And to your example:

abc->x = index; is sequenced before delete abc; which happens before auto abc = new ABC; and transitively abc->x = index; happens before auto abc = new ABC;. That guarantees that the abc->x = index; is complete before auto abc = new ABC;.

CodePudding user response：

Stores from the "store buffer" becoming globally visible (committing to coherent cache) after the delete statement in Thread 1 is not by itself a problem.

An error requires that Thread 2 writes to the same location and then the subsequent read operation in Thread 2 somehow obtains the value stored by Thread 1.

This would require that the read operation ignores the "store buffer" and takes the value directly from RAM. I am not an expert in X86 memory control, but clearly such a design would not work. A read/write operation for a given memory location would need to wait for earlier write operations to be completed or look at the "store buffer" in the same way as applying a memory cache. It is a responsibility of the CPU design to ensure that memory operations are sequenced in this way.

This is not in scope of the C standard the same way as the operation of a RAM cache is not a concern of the standard. It is just assumed that the CPU ensures that the value read from a memory location is the last value written to that location.