I got this question about loop unrolling in mips but I could not figure out how once I got to the step I will show you below, I am not even sure about this steps. I am new to computer Arch, I just had this code snippet which is in assembly:
Loop: ld $f0, 0($s1)
ld $f4, 0($s2)
multd $f0, $f0, $f4
addd $f2, $f0, $f2
subi $s1, $s1, 8
subi $s2, $s2, 8
bneqz $s1, Loop
There is additional information given and they are as follows:
- a one-cycle delayed branch and the following table:
Instruction producing result | Instruction using result | Latency in clock cycles |
---|---|---|
FP ALU op | Another FP ALU op | 3 |
FP ALU op | Store Double | 2 |
Load double | FP ALU op | 1 |
Load double | Store double | 0 |
So what I did is insert the stalls then try to reorder the instructions:
Loop: ld $f0, 0($s1)
ld $f4, 0($s2)
STALL
multd $f0, $f0, $f4
STALL
STALL
STALL
addd $f2, $f0, $f2
subi $s1, $s1, 8
subi $s2, $s2, 8
bneqz $s1, Loop
I moved the addd
instruction after the bneqz
instruction then I got stuck, can anyone help explain?
CodePudding user response:
Without unrolling or doing anything tricky, you can hide all but one stall cycle.
# $f0 = 0.0
# $t0 = end_pointer = $t1 size
Loop: ld $f6, 0($t1)
ld $f8, 0($t2)
addiu $t1, $s1, 8 # fills load-delay slot
multd $f6, $f6, $f8
addiu $t2, $s2, 8 # multd in flight cycle 1
bneq $t1, $t0, Loop # multd in flight cycle 2
STALL waiting for $f6
addd $f0, $f0, $f6
There are only 2 instructions between multd
and addd
. With software pipelining, we can avoid that stall without unrolling. You might also call this a loop rotation, since the 1-iteration window of the loop that appears in the source doesn't start with a load and doesn't end with an addd
. Loop rotation enables software-pipelining (putting the loads in the shadow of the multd
to set up for the next iteration).
- Peel the 2 loads from the first iteration, so the top of the loop can start with
multd
- Peel the multd/addd that consume those loads from the last iteration, so you'll need an extra copy of those instructions after the loop.
ld $f6, 0($t1) # partially peeled first iteration
ld $f8, 0($t2)
Loop:
# this iteration's loads are already done
multd $f4, $f6, $f8 # into a different reg so loads won't overwrite it
ld $f6, 8($t1)
ld $f8, 8($t2) # offset=8 for next iter before the pointer increment
addiu $t1, $s1, 8
addiu $t2, $s2, 8
bneq $t1, $t0, Loop
addd $f0, $f0, $f4
# loop ends with pointers pointing one-past-end of the arrays
multd $f4, $f6, $f8
addd $f0, $f0, $f4
Having the multiply write to a temp register instead of overwriting one of the registers we load into is essentially avoiding a WAW hazard to make this manual reordering possible. Having lots of architectural registers enables plenty of software-pipelining even for longer dependency chains. Speaking of which:
- 5 instructions between the
multd
and theaddd
that consumes its result. - 4 instructions between the last
ld
and themultd
next iteration that consumes its result. Higher-clocked CPUs won't have single-cycle load-use latency even for L1d cache hits, so this is a good thing even if not necessary in the pipeline you're tuning for. We could have put the loads after the pointer increments on that, but it wouldn't save any code-size, and presumably the same performance whether or not the 16-bit immediate displacement is zero or not on most MIPS CPUs. - 6 instructions between
addd
on iteration andaddd
the next, the loop-carried dependency in this reduction.
On higher-performance CPUs, that loop-carried dependency would become a bottleneck and you'd need to unroll with multiple accumulators to hide it. e.g. Intel Haswell can run two FMAs (Fused Multiply-Add) per clock, with 5 cycle latency, so you need at least 10 accumulators (registers to sum into) to hide that latency. Or fewer in a dot product since you bottleneck on 2 loads per clock; each FMA needs two loads so you bottleneck on that. (Superscalar out-of-order exec of up to 4 front-end micro-ops per clock, with x86 being able to combine an FMA load into one uop for the pipeline leaving room for pointer increments. Very different from a 1-IPC MIPS pipeline where even without unrolling, you don't hit any inter-iteration bottlenecks.)
With unrolling, you don't need software-pipelining on this pipeline:
Loop: ld $f6, 0($t1)
ld $f8, 0($t2)
ld $f10, 8($t1)
ld $f12, 8($t2)
multd $f6, $f6, $f8 # 2 insn gap after loading $f8
multd $f10, $f10, $f12 # 1 insn gap after loading $f12
addiu $t1, $s1, 16
addiu $t2, $s2, 16
addd $f0, $f0, $f6 # 3 insn gap after multd into $f6
bneq $t1, $t0, Loop # 2 insn gap after pointer increment
addd $f2, $f2, $f10 # 4 insn gap after multd into $f10
# after the loop: reduce the two accumulators to one
addd $f0, $f0, $f2 # runs once per dot product, not per element
# so a stall hardly matters if you can't schedule it after other post-loop work.
We could almost hide FP latency without multiple accumulators or software pipelining, to avoid that extra addd outside the loop, if we schedule to the limits of load latency. Since we're mixing things around, I indented differently for the instructions associated with the two different mul/add ops.
Loop: ld $f6, 0($t1)
ld $f8, 0($t2)
ld $f10, 8($t1)
multd $f6, $f6, $f8 # 1 insn gap after loading $f8
ld $f12, 8($t2)
addiu $t1, $s1, 16
multd $f10, $f10, $f12 # 1 insn gap after loading $f12
addd $f0, $f0, $f6 # 3 insn gap after multd into $f6
addiu $t2, $s2, 16
bneq $t1, $t0, Loop
addd $f0, $f0, $f10 # STALL: 2 INSN GAP after addd into $f0
So no, it turns out we can't hide all the latency this way, without multiple accumulators or software pipelining.
If you had to match the numeric results of purely serial code (like compiling without -ffast-math
, strict-FP), you could software-pipeline this to hide that one stall cycle, rotating the loop so the loads are filling gaps after mults and adds.
But multiple accumulators are typically better numerically, assuming uniformly distributed non-negative things you're summing, so you're adding the same size small numbers to ever-larger values, with worse-and-worse rounding error. See Simd matmul program gives different numerical results - different from sum = a[i] * b[i]
doesn't mean worse; usually it's numerically more accurate. As in Is this way of processing the tail of an array with SSE overkill? where elements of SIMD vectors are multiple accumulators, we see less error than a simple serial sum.
Using two accumulators costs only one extra addd
outside the loop (beyond any checks for unrolling), vs. loop rotation peeling a whole iteration of the loop body (except pointer increments), part before the loop and part after.
I made some other changes to your loop for sanity:
- use
$t
temporary registers; no need to use call-preserved$s
regs for loop counters. - Increment pointers forward until reaching an end-pointer in another register. Like
do{ p1 ; p2 ; } while(p1 != endp);
instead of counting down until NULL pointer (address0
). - Treat pointers as unsigned; you don't want to trap on signed-overflow if one of the arrays happens to span the boundary between the low 2GiB and high 2GiB of memory. (MIPS
add
/sub
instructions trap on signed overflow,addu
/subu
don't. They're the same binary operation on the data.)