Why is LOOP faster than DEC,JNZ on 8086?-CodePudding

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.

But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.

Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.

CodePudding user response：

It's not decode that's the problem, it's usually fetch on 8086.

Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.

Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).

dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.

8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)

See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)

Decode / execute timings, not including fetch:

DEC     Decrement

    operand     bytes   8088    186     286     386     486     Pentium
    r8           2       3       3       2       2       1       1   UV
    r16          1       3       3       2       2       1       1   UV
    r32          1       3       3       2       2       1       1   UV
    mem       2 d(0,2)  23 EA   15       7       6       3       3   UV

Jcc     Jump on condition code

    operand     bytes   8088    186     286     386     486     Pentium
    near8        2      4/16    4/13    3/7 m   3/7 m   1/3     1    PV
    near16       3       -       -       -      3/7 m   1/3     1    PV

LOOP    Loop control with CX counter

      operand   bytes   8088    186     286     386     486     Pentium
      short      2      5/17    5/15    4/8 m   11 m    6/7     5/6  NP

So even ignoring code-fetch differences:

dec taken jnz takes 3 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.

(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)

8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).

CodePudding user response：

I presume that its decoding should also be more complex

There's no reason that the decoding is more complex for the loop instruction. This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.

Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.

CodePudding user response：

Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.

I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.

According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).

The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.