How is a critical path formed when there is a data dependency between a loop iterations while a CPU-CodePudding

In the Computer Systems From A Programmer's Perspective book there is the following assembly code for a loop:

L3:
 1. movq %rax, (%rsi)
 2. movq (%rdi), %rax
 3. addq $1, %rax
 4. subq $1, %rdx
 5. jne  .L3

The abstract CPU used for modeling problems have the following functional units:

Operation	Latency	Issue	Capacity
	1	1	4
load	4	1	2

Assuming that %rsi and %rdi points to different memory locations the author says that critical path is formed by subq instruction. Indeed there is a data dependency since to decrease %rdx on every iteration we need a value from the previous iteration.

But isn't there a data dependency between lines 2-3 and line 1 as well? If yes, why it doesn't form a critical path? My thoughts to answer it are the following:

CPU can perform line 2 and line 3 well ahead for next iterations using a technique called a register renaming
Thus when line 1 is going to be executed for some later iterations, the data for it is ready.

If my reasoning makes sense, then what determines how many writes operations which uses the same register (in this case loadq and addq to %rax) a CPU can execute in advance?

My goal here is to get a very high level understanding how these things might work. Let me know please if the information provided is not enough to answer the question.

CodePudding user response：

Yes, there are many separate short 3-instruction dependency chains, each one started by the load (2) in one iteration and ending with the store (1) next iteration. (Assuming no memory aliasing).

But there's no loop-carried dependency chain except the loop counter; that 3-instruction dep chain doesn't couple back in to input dependencies for any of those instructions the next time they execute. The load (2) breaks the dependency on the old value of RAX, so this is the start of another short dep chain, not a continuation of the chain started the last time it ran.

It would be loop-carried if add $1, %rax was instead add %rax, %rdi or some other instruction that made the next load address dependent on the previous load. In that case only the store would be "forking off" each iteration. As in the random memory-access microbenchmark in Why one line of code to decline performance 10x?

So a wide out-of-order exec CPU can have any number of those dep chains in flight in parallel, overlapping their latencies.

In fact, Sandybridge and later should be able to run this loop at 1 iteration per clock, as 4 uops thanks to macro-fusion of sub/jnz into a single uop for the front and back end. Bottlenecked on front-end throughput as well as back-end latency of sub; Ice Lake's 5-wide front-end won't run it any faster.

uiCA can simulate it on Haswell to see how these instructions would really go through the actually front-end and back-end of a real pipeline, assuming all loads and stores hit in L1d cache and don't alias. (Simulated as accurately as possible from reverse-engineering to model the front-end behaviour and uop scheduling for loops. It doesn't simulate cache.) That link shows Haswell sustaining 1.0 cycle per iteration. The trace table visualization unrolls the loop iterations and shows which cycle each instruction issued from the front-end into the back-end, and when it dispatched to an execution unit, finished execution, and when it retired. (If that trace table link dies, use the first link to have it re-generate by re-analyzing from source; the asm source is part of the URL for the first link.)

The non-trace output is as follows. (LSD = loop buffer, MITE=legacy decode, DSB=uop cache, MS=microcode sequencer. All the uops here come from the loop buffer which "locks them down". Issued vs. Exec is fused-domain uops vs. unfused. Intel CPUs decode stores as separate store-address and store-data uops, micro-fused into one front-end uop that runs as 2 in the back-end, but can issue as 1 and use only 1 ROB entry.)

Throughput (in cycles per iteration): 1.00
Bottlenecks: Dependencies, Issue, LSD, Ports

The following throughputs could be achieved if the given property were the only bottleneck:

  - LSD: 1.00
  - Issue: 1.00
  - Ports: 1.00
  - Dependencies: 1.00

M - Macro-fused with previous instruction

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┬───────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │ Notes │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│                    1  │   1    │   2   │                    0.13     0.25      1                         0.63  │       │ mov qword ptr [rsi], rax
│                    1  │   1    │   1   │                    0.5      0.5                                       │       │ mov rax, qword ptr [rdi]
│                    1  │   1    │   1   │  0.32     0.33                                0.35                    │       │ add rax, 0x1
│                    1  │   1    │   1   │                                                         1             │       │ sub rdx, 0x1
│                       │        │       │                                                                       │   M   │ jnz 0x6
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│                    4  │   4    │   5   │  0.32     0.33     0.63     0.74      1       0.35      1       0.63  │       │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┴───────┘

Zen 1 and later can also keep up with this loop at 1/clock; AMD only fuses test or cmp with jcc, not instructions that also modify an integer register, so it's 5 uops to issue/rename. But Zen's front-end is 5 instructions / 6 uops wide so its front-end bandwidth can just barely keep up with the 1 cycle latency bottleneck of subq $1, %rdx

FYI, the CPU model CS:APP uses is basically Intel Haswell (https://www.realworldtech.com/haswell-cpu/): note the 4/clock throughput for add/sub, up from 3 in Sandybridge-family. And in other CS:APP problems, we see their CPU has 2/clock FP FMA/mul, but 1/clock FP add, and the latencies match Haswell. e.g. Latency bounds and throughput bounds for processors for operations that must occur in sequence Also note how a mulsd result is read every iteration, but by a different dep chain.

Anyway, if there had been any loop-carried dependency longer than 1 cycle, it would have been a bottleneck. e.g. add an or $0, %rdx to the loop, making it a 2-cycle dependency, and uiCA correctly simulates that being the bottleneck, averaging 2.0 cycles per iteration.

Limits of how far out-of-order exec can see

If my reasoning makes sense, then what determines how many writes operations which uses the same register (in this case loadq and addq to %rax) a CPU can execute in advance?

In this case, a front-end bottleneck probably stops the back-end from getting very far ahead. But if not for that (or on an Ice Lake CPU, 5-wide front-end for a 4 uop loop), it would be limited by out-of-order execution resources, specifically by number of physical registers to rename onto.

Haswell's integer PRF has 168 entries. (https://www.realworldtech.com/haswell-cpu/3/).

Also potentially by the ReOrder Buffer (ROB) capacity. Haswell's ROB is 192 uops in size. The store doesn't write a register, so by some very simplistic approximations, only 3/4 of the uops in the loop need to allocate a physical register for their output. More than 16 of the 168 PRF entries are already needed to hold the architectural state (including values of registers you aren't modifying), but as a hand-wavy rough calculation, filling the ROB with 192 uops would also use up 192 * 3/4 = 144 physical register file entries.

See https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for real-world experiments like this.

A significantly smaller limit is the scheduler size (RS = reservation station). In Haswell that's only 60 uops. It tracks unexecuted uops (in the unfused domain for Sandybridge-family, so stores take 2 slots here.) vs. the ROB tracking not retired uops.

See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths re: effects of RS size for overlapping long dep chains.

With a wider front-end, the un-executed uops filling the RS would mostly be the 1 per clock sub $1, %rdx jnz uops (from macro-fusion of sub branch into a single uop for port 6 since it's predicted-taken), and the store-data uops (also limited to 1/clock on Haswell, unlike store-address which can run on any of p2/p3/p7 for this simple addressing mode).

The load and add uops can execute and free up their RS entries (at more than 1 per cycle on average), waiting to retire from the ROB when all earlier instructions have been executed.

If you simulate this code on uiCA for a wider CPU like Ice Lake, notice the increasing number of cycles between issue and retire as you scroll down the trace. Unlike Haswell, the front-end does get far ahead of the back-end. (Ice Lake has 2/clock store throughput including store-data, so only the latency bottlenck will back up those sub/jnz uops. Other than that, it looks like it would for your Haswell back-end with a wider front-end. Of course, Ice Lake has a larger ROB, RS, and PRF than Haswell.)

Also related: