How to unroll a loop of a dot product in mips after re-ordering instructions?-CodePudding

I got this question about loop unrolling in mips but I could not figure out how once I got to the step I will show you below, I am not even sure about this steps. I am new to computer Arch, I just had this code snippet which is in assembly:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2) 
      multd $f0, $f0, $f4 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

There is additional information given and they are as follows:

a one-cycle delayed branch and the following table:

Instruction producing result	Instruction using result	Latency in clock cycles
FP ALU op	Another FP ALU op	3
FP ALU op	Store Double	2
Load double	FP ALU op	1
Load double	Store double	0

So what I did is insert the stalls then try to reorder the instructions:

Loop: ld $f0, 0($s1) 
      ld $f4, 0($s2)
      STALL
      multd $f0, $f0, $f4
      STALL
      STALL
      STALL 
      addd $f2, $f0, $f2 
      subi $s1, $s1, 8 
      subi $s2, $s2, 8 
      bneqz $s1, Loop

I moved the addd instruction after the bneqz instruction then I got stuck, can anyone help explain?

CodePudding user response：

Without unrolling or doing anything tricky, you can hide all but one stall cycle.

# $f0 = 0.0
# $t0 = end_pointer = $t1 size 
Loop: ld    $f6, 0($t1)
      ld    $f8, 0($t2)
      addiu $t1, $s1, 8         # fills load-delay slot

      multd $f6, $f6, $f8
      addiu $t2, $s2, 8         # multd in flight cycle 1

      bneq $t1, $t0, Loop       # multd in flight cycle 2
                             STALL  waiting for $f6
      addd $f0, $f0, $f6

There are only 2 instructions between multd and addd. With software pipelining, we can avoid that stall without unrolling. You might also call this a loop rotation, since the 1-iteration window of the loop that appears in the source doesn't start with a load and doesn't end with an addd. Loop rotation enables software-pipelining (putting the loads in the shadow of the multd to set up for the next iteration).

Peel the 2 loads from the first iteration, so the top of the loop can start with multd
Peel the multd/addd that consume those loads from the last iteration, so you'll need an extra copy of those instructions after the loop.

      ld    $f6, 0($t1)    # partially peeled first iteration
      ld    $f8, 0($t2)
Loop:
      # this iteration's loads are already done
      multd  $f4, $f6, $f8   # into a different reg so loads won't overwrite it
      ld     $f6, 8($t1)
      ld     $f8, 8($t2)     # offset=8 for next iter before the pointer increment
      addiu  $t1, $s1, 8 
      addiu  $t2, $s2, 8

      bneq   $t1, $t0, Loop
      addd   $f0, $f0, $f4
# loop ends with pointers pointing one-past-end of the arrays

      multd  $f4, $f6, $f8
      addd   $f0, $f0, $f4

Having the multiply write to a temp register instead of overwriting one of the registers we load into is essentially avoiding a WAW hazard to make this manual reordering possible. Having lots of architectural registers enables plenty of software-pipelining even for longer dependency chains. Speaking of which:

5 instructions between the multd and the addd that consumes its result.
4 instructions between the last ld and the multd next iteration that consumes its result. Higher-clocked CPUs won't have single-cycle load-use latency even for L1d cache hits, so this is a good thing even if not necessary in the pipeline you're tuning for. We could have put the loads after the pointer increments on that, but it wouldn't save any code-size, and presumably the same performance whether or not the 16-bit immediate displacement is zero or not on most MIPS CPUs.
6 instructions between addd on iteration and addd the next, the loop-carried dependency in this reduction.

On higher-performance CPUs, that loop-carried dependency would become a bottleneck and you'd need to unroll with multiple accumulators to hide it. e.g. Intel Haswell can run two FMAs (Fused Multiply-Add) per clock, with 5 cycle latency, so you need at least 10 accumulators (registers to sum into) to hide that latency. Or fewer in a dot product since you bottleneck on 2 loads per clock; each FMA needs two loads so you bottleneck on that. (Superscalar out-of-order exec of up to 4 front-end micro-ops per clock, with x86 being able to combine an FMA load into one uop for the pipeline leaving room for pointer increments. Very different from a 1-IPC MIPS pipeline where even without unrolling, you don't hit any inter-iteration bottlenecks.)

With unrolling, you don't need software-pipelining on this pipeline:

Loop: ld $f6, 0($t1)
      ld $f8, 0($t2)
      ld $f10, 8($t1)
      ld $f12, 8($t2)

      multd $f6, $f6, $f8           # 2 insn gap after loading $f8
      multd $f10, $f10, $f12        # 1 insn gap after loading $f12
      addiu $t1, $s1, 16
      addiu $t2, $s2, 16

      addd $f0, $f0, $f6            # 3 insn gap after multd into $f6
      bneq $t1, $t0, Loop       # 2 insn gap after pointer increment
      addd $f2, $f2, $f10           # 4 insn gap after multd into $f10

# after the loop: reduce the two accumulators to one
      addd  $f0, $f0, $f2       # runs once per dot product, not per element
                                # so a stall hardly matters if you can't schedule it after other post-loop work.

We could almost hide FP latency without multiple accumulators or software pipelining, to avoid that extra addd outside the loop, if we schedule to the limits of load latency. Since we're mixing things around, I indented differently for the instructions associated with the two different mul/add ops.

Loop:  ld $f6, 0($t1)
       ld $f8, 0($t2)
         ld $f10, 8($t1)
       multd $f6, $f6, $f8         # 1 insn gap after loading $f8
         ld $f12, 8($t2)
     addiu $t1, $s1, 16

         multd $f10, $f10, $f12       # 1 insn gap after loading $f12
       addd $f0, $f0, $f6          # 3 insn gap after multd into $f6
     addiu $t2, $s2, 16

     bneq $t1, $t0, Loop
         addd $f0, $f0, $f10          # STALL: 2 INSN GAP after addd into $f0

So no, it turns out we can't hide all the latency this way, without multiple accumulators or software pipelining.

If you had to match the numeric results of purely serial code (like compiling without -ffast-math, strict-FP), you could software-pipeline this to hide that one stall cycle, rotating the loop so the loads are filling gaps after mults and adds.

But multiple accumulators are typically better numerically, assuming uniformly distributed non-negative things you're summing, so you're adding the same size small numbers to ever-larger values, with worse-and-worse rounding error. See Simd matmul program gives different numerical results - different from sum = a[i] * b[i] doesn't mean worse; usually it's numerically more accurate. As in Is this way of processing the tail of an array with SSE overkill? where elements of SIMD vectors are multiple accumulators, we see less error than a simple serial sum.

Using two accumulators costs only one extra addd outside the loop (beyond any checks for unrolling), vs. loop rotation peeling a whole iteration of the loop body (except pointer increments), part before the loop and part after.

I made some other changes to your loop for sanity:

use $t temporary registers; no need to use call-preserved $s regs for loop counters.
Increment pointers forward until reaching an end-pointer in another register. Like do{ p1 ; p2 ; } while(p1 != endp); instead of counting down until NULL pointer (address 0).
Treat pointers as unsigned; you don't want to trap on signed-overflow if one of the arrays happens to span the boundary between the low 2GiB and high 2GiB of memory. (MIPS add/sub instructions trap on signed overflow, addu / subu don't. They're the same binary operation on the data.)