Globally Invisible load instructions-CodePudding

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache.
As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible.

CodePudding user response：

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)

At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent consistent cache memory state. On x86, where StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(Correctly observing the global order of your own stores relative to other stores requires mfence or similar: otherwise store-forwarding means you can see your own stores right away, before they become visible to other core. x86 TSO is basically program-order plus store-forwarding.)

(For cache-bypassing stores, global visibility is when they're flushed from private write-combining buffers into DRAM. Intel Line Fill Buffers or any equivalent private write-combining mechanism where store data is still not visible to other CPUs is effectively part of the store buffer for our reordering purposes.)

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

The IBM POWER ISA is that weak, and so is the C 11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). But the mechanism in practice on POWER is that (retired aka graduated) stores become visible to some other cores before they become globally visible by committing to L1d cache. Cache itself really is coherent even in POWER systems, like all normal CPUs, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) providing a way to see stores from other logical cores without going through cache.

(One possible mechanism is be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

Weakly-ordered ISAs that aren't as weak as POWER (in practice and/or on paper) still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)

Let's do a thought experiment about where load data comes from

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)

Partial store-forwarding exposes the fact that load data doesn't always come from the global coherent cache domain.

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.

Proposed definition: A load becomes globally visible when it reads its data. Normally from L1d, but the store buffer or MMIO or uncacheable memory are other possible sources.

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

Although perhaps it would make more sense not to talk about "global visibility" of cacheable loads at all, because they're pulling data from somewhere, not doing anything with a visible effect. Only uncacheable loads (e.g. from an MMIO region) should be considered visible side-effects.

(On x86, uncacheable stores and loads are very strongly ordered, so store-forwarding to an uncachable store is I think impossible. Unless maybe the store was done via a WB mapping of the same physical page as the UC load is accessing.)

CodePudding user response：

Let me expand the question a little bit and discuss the correctness aspect of implementing store-load forwarding. (The second half of Peter's answer directly answers the question I think).

Store-load forwarding changes the latency of the load, not its visibility. Unless it got flushed due to some misspeculation, the store eventually is going to become globally visible anyway. Without store-load forwarding, the load has to wait until all conflicting stores to retire. Then the load can fetch the data normally.

(The exact definition of a conflicting store depends on the memory ordering model of the ISA. In x86, assuming the WB memory type, which allows store-load forwarding, any store that is earlier in program order and whose target physical memory location overlaps that of the load is a conflicting store).

Although if there is any concurrent conflicting store from another agent in the system, that might actually change the value loaded because the foreign store may take effect after the local store but before the local load. Typically, the store buffer is not in the coherence domain, and so store-load forwarding may reduce the probability of something like that happening. This depends on the limitations of the store-load forwarding implementation; there is usually no guarantees that forwarding will happen for any particular load and store operations.

Store-load forwarding may also result in global memory orders that would have not been possible without it. For example, in the strong model of x86, store-load reordering is allowed and together with store-load forwarding may allow each agent in the system to view all memory operations in different orders.

In general, consider a shared memory system with exactly two agents. Let S1(A, B) be the set of possible global memory orders for the sequences A and B with store-load forwarding and let S2(A, B) be the set of possible global memory orders for the sequences A and B without store-load forwarding. Both S1(A, B) and S2(A, B) are subsets of the set of all legal global memory orders S3(A, B). Store-load forwarding can make S1(A, B) not be a subset of S2(A, B). This means that if S2(A, B) = S3(A, B), then store-load forwarding would be an illegal optimization.

Store-load forwarding may change the probability of each global memory order to occur because it reduces the latency of the load.

CodePudding user response：

A load is dispatched from the RS (Reservation Station) and goes through the AGU (Address Generation Unit) to the load buffer entry that was allocated for the corresponding ROB (Reorder Buffer) entry at the allocate stage. When the load buffer entry was allocated, it was coloured with the most recent SBID (store buffer ID) at that time. Coloured means the entry number (aka. ID) of the most recent store in the store buffer is inserted into the load buffer entry. The store buffer comprises the SAB (Store Address Buffer) and SDB (Store Data Buffer); each store has an entry in both (because each store is 2 uops, usually microfused) and they both have the same index (entry no aka. SBID).

I think once the address is valid, the valid bit in the entry then gets set, meaning they are ready to dispatch (and is cleared when the data is eventually written back to the ROB).

There is also a speculative memory disambiguation predictor which may be involved in the setting of the valid bit to indicate that it is predicted to not alias with any stores between the SBID that it is coloured with, and the tail pointer store in the store buffer (store address in the SAB and the data in the SDB). If it is predicted to alias, or does actually alias (I.e. It searches the store buffer for an address and uses the bitmask in the SAB to determine whether the entry can satisfy it (the bitmask states the privilege level of the bytes supervisor / non-supervisor), and uses the implied size from the opcode to get the range of addresses that are being stored to by the store operation. If it can be satisfied, it reads from the SDB entry), it does speculative store-to-load forwarding using the data in the SDB and inserts the data in the load buffer and the load is completed in the LB (Load Buffer), but does not retire from the LB. The store-to-load forwarding ensures that reads can never be reordered with older writes to the same location, because the read will always use store-to-load forwarding. I think all store addresses before an LFENCE's SBID need to be calculated before making a prediction on a store after and LFENCE.

If it is not predicted to alias, the load is dispatched (and the loads are always dispatched in strict order with respect to other loads, unless the load has a non temporal hit or is to USWC (Uncacheable Speculative Write Combining memory type) memory (although, unlike stores, it doesn't know whether or not it is USWC at this stage). The load goes to the dTLB (data TLB) / L1d (L1 data cache) in parallel.

At any time, when store addresses complete in the SAB with any SBID less than or equal (taking wrap around into account) to the coloured SBID of the load in question, it can invalidate the memory disambiguation prediction made, and the pipeline is flushed, because the pipeline is now either using stale data stored before the store with which it should have performed store-to-load-forwarding with, or it is using false store-to-load forwarding data from a store that it actually had no dependency with.

When the data is loaded in the designated physical destination register, the data becomes valid in the ROB. When the data in the ROB is valid and a retirement pointer is pointing to the entry, the load is no longer speculative and acquires a senior bit. The load can then retire from (be removed from) the LB if a bit is set that indicates all stores between the SAB tail pointer and the coloured SBID have had their addresses calculated. Unless it is a senior load instruction, in which case, it can now execute now that it is senior and has retired from the ROB.

LFENCE is dispatched to the load buffer and only executes (is sent to the L1d cache) when all previous uops have retired from the ROB and when all previous load instructions have retired from the ROB LB (according to the instruction stream serialising properties it is claimed to have, it is probably retired in a cycle on its own rather than with 1 or 2 other instructions before it in the ROB in same cycle). Load instructions retire when the ROB tells them they can retire (no longer speculative) and the data fetched is valid and the load is no longer memory-speculative. LFENCE dispatches when it is at the tail of the load buffer and ROB (It cannot retire until all read buffers are globally visible. I think this means that it makes sure that any senior load instructions (instructions that execute after retirement from the ROB and when they get marked as senior) such as PREFETCH have allocated read buffers. Regular loads allocate read buffers and read their data and it becomes valid in the load buffer before they can be retired. Globally visible in this case means all previous read LFBs (Line Fill Buffers) have received globally visible notifications from the ring for the line (which could come before the read response containing the data, or could be packaged into the read response, which may mean that it has to wait for all reads to complete as opposed to be acknowledged) (of course, instructions that have retired from the MOB (Memory Order Buffer) already are globally visible as their data has returned, but senior load instructions might not have allocated read buffers yet or had them acknowledged to be globally visible) (this is similar to the definition of globally visible stores, where in response to an RFO (Read For Ownership), the global observation for the LFB likely comes in the notification that the core has permission (exclusive access) of the line and other cores have been invalidated, which will come before the actual data in the line to write is returned to the core, assuming that this will always be written back before responding to a snoop where it loses permission on the line). When LFENCE dispatches, the L1d cache treats it as a nop and it completes, retires in the ROB, becomes senior i.e. is removed from the LB and uops before it in the load buffer that were prevented from dispatching to the L1d cache are now allowed to be dispatched.

Global visibility of loads does affect cache coherence state of other cores, which I think is why LFENCE requires loads to be globally visible. A load miss in the core goes to the LLC (Last Level Cache) which has a snoop filter showing that only one other core owns the line. If 1>= cores own the line, then it needs to downgrade that core to an S state and cause it to write back modified data. The data written to LLC can then be returned to the requesting core with an S state and a globally visible notification. If a load miss in the core instead misses the LLC, the LLC might send a globally visible notification immediately while sending the request to the home agent to fetch it from memory (or if it is a multisocket system, the LLC has to wait for acknowledgement from the home agent that it does not need to snoop other cores before it can send the globally observable notification to the core).

I think a senior load is a load that is no longer speculative and is waiting for the data to be returned and become valid, or it is already valid so instantly retires, whereas a senior load instruction is an instruction that dispatches after it has been retired from the ROB.