Currently working on understanding MIPS architecture and assembly language, I've been asked to place NOPs in the following assembly code:
1 ADD R2,R1,R3
2 SW R3,0(R2)
3 ADD R4,R3,R2
4 LOOP: LW R8,2000(R4)
5 SUB R5,R4,R8
6 ORI R23,R8,370
7 ADD R19,R5,R8
8 LW R7,1000(R16)
9 SW R7,2000(R16)
10 LW R15,1000(R7)
11 SW R14,2000(R15)
12 SUBI R15,R15,99
13 AND R6,R14,R15
14 SLT R21,R11,R28
15 BEQ R21,R6,LOOP
I've placed the NOPs according to how I've been taught, and got this:
1 ADD R2,R1,R3
3 NOPs
2 SW R3,0(R2)
3 ADD R4,R3,R2
3 NOPs
4 LOOP: LW R8,2000(R4)
3 NOPs
5 SUB R5,R4,R8
6 ORI R23,R8,370
2 NOPs
7 ADD R19,R5,R8
8 LW R7,1000(R16)
3 NOPs
9 SW R7,2000(R16)
10 LW R15,1000(R7)
3 NOPs
11 SW R14,2000(R15)
12 SUBI R15,R15,99
3 NOPs
13 AND R6,R14,R15
14 SLT R21,R11,R28
3 NOPs
15 BEQ R21,R6,LOOP
This feels awfully off to me, I believe there is a way to place less NOPs and still handle all RAW data hazards in this code.
EDIT:
I've been able to reduce the number of NOPs by 3, after seeing that line 2 and 3 has no dependencies, due to SW not chaning the value stored in R3.
CodePudding user response:
This feels awfully off to me, I believe there is a way to place less NOPs and still handle all RAW data hazards in this code.
Looks to me like all those NOPs are required, since 3 cycles after decoding one instruction is when it writes back in a classic 5-stage RISC. That's the cycle another instruction can enter ID and read the value from the register file without any special detection of the RAW hazard to set up bypass forwarding or to stall the pipeline.
The cases of 2 NOPs look correctly placed where there's one independent instruction between an instruction and the one consuming its result.
It's normal for instructions to use the result of the previous one, especially if it wasn't a load, that's why real CPUs need to avoid having to stall like this to not suck. e.g. first-gen commercial MIPS R2000 had bypass forwarding, with no stalls for anything except cache-miss load/store, or mult / div. An instruction could use the result of the previous instruction with no penalty.
It didn't even stall for loads, instead software had to respect the "load delay slot", not using a load result until 1 instruction later (otherwise there was no guarantee whether you'd see the old or new register value, depending on cache miss!). Later MIPS revisions changed that, since compilers often couldn't fill it with anything except NOP, and that hurt I-cache density. The branch delay slot was another matter; it couldn't be changed, but was another part of what let classic MIPS hide branch latency and not need branch prediction, forwarding from the first half-cycle of beq
to the 2nd half-cycle of the IF stage.
(Also, this code is intentionally badly scheduled (especially store right after load), to give you something to do in the next step of the assignment.
It's also totally artificial, computing some results that aren't used inside the loop and could just be done after (aka sunk out of the loop), or done before (hoisted) like 14 SLT R21,R11,R28
which doesn't depend on any of the other instructions. But that would make scheduling "harder", because these useless instructions work as filler like NOPs.)
Presumably the point of this exercise is to show you that CPUs would have tons of stalls that are hard to schedule around if they didn't have bypass forwarding. Mostly only highly unrolled (and software-pipelined) loops could spend most of their cycles doing work instead of being stalled. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing
I cant seem to find a way to efficiently reduce the number of NOPs by rescheduling, most I've been able to reduce is around 5-6 NOPs, which feels to me like a lot more could be done. If you were to reschedule the MIPS assembly code I've shown above, what would you do?
There are a few WAW / WAR hazards that prevent reordering; if we could use different registers as well as reordering that would open up more possibilities. e.g. SUBI R15,R15,99
could subtract into a temporary so it could be done before / in parallel with SW R14,2000(R15)
. Or the store could use 2099(R15)
in the addressing mode to offset the subtract. (See How to unroll a loop of a dot product in mips after re-ordering instructions? for an example of that kind of optimization).
I just realized I haven't been considering the possibility of memory aliasing when reordering loads and stores wrt. each other. e.g. if SW R7,2000(R16)
/ LW R15,1000(R7)
could be a reload of that store, scheduling the store later would change the program. Let's pretend that we can assume no aliasing.
After looking at the dependency chains and using that to figure out which instructions to do earliest (starts of the longest dep chains first):
## Without considering memory aliasing, reordering some stores after loads.
1 ADD R2,R1,R3
3 NOPs
3 ADD R4,R3,R2 # dep on 1 (R2). R4 is read-only after this, in the loop
2 SW R3,0(R2) # dep on 1 (R2)
1 NOPs # put this NOP outside the loop because 4 only needs to stall on the first iter.
LOOP:
8 LW R7,1000(R16) # no deps on anything in the shown code
4 LW R8,2000(R4) # dep on 3 (R4) before loop. R8 is only written here.
14 SLT R21,R11,R28 # independent of everything.
1 NOPs
10 LW R15,1000(R7) # dep on 8 (R7)
5 SUB R5,R4,R8 # dep on 4 (R8). (which also depends on the same R4 value from 3 we read)
9 SW R7,2000(R16) # dep on 8 (R7). R16 is read-only
6 ORI R23,R8,370 # dep on 4 (R8). R23 is write-only
11 SW R14,2000(R15) # dep on 10 (R15). R14 is read-only
12 SUBI R15,R15,99 # dep on 10 (R15), WAR on 11
7 ADD R19,R5,R8 # dep on 5 (R5), and 4 (R8) indirectly via 5 as well as directly. R19 is write-only
2 NOPs
13 AND R6,R14,R15 # dep on 12 (R15). R14 is read-only
3 NOPs
15 BEQ R21,R6,LOOP # dep on 14 (R21), 13 (R6)
6 NOPs inside the loop, down from 17. (And saved 2 NOPs before the loop, too, which helps for code size / I-cache footprint.)
This may not be optimal, but I don't see any more movements of single instructions that would save anything. Converting 11 SW R14,2000(R15)
to SW R14,2099(R15)
would allow 12 SUBI R15,R15,99
to start a cycle sooner, and the store could fill one of the stall slots near the tail of the loop. And/or subi
into a different register would allow the sub to run first, allowing the same benefit without lengthening the dependency chain.
Dependency chain involving R15: 8 LW R7,1000(R16)
-> 10 LW R15, 1000(R7)
-> 12 SUBI R15,R15,99
-> 13 AND R6,R14,R15
-> 15 BEQ R21,R6,LOOP
. With 11 SW R14,2000(R15)
also reading the R15 load result, before SUBI.
Dep chain starting with 4 LW R8,2000(R4)
: nothing writes R4 in the loop, so this load could just get hoisted out of the loop, unless some other pointer aliases the same address. Multiple things read R8, but only the load writes it. From those reads of R8:
6 ORI R23,R8,370
is a dead end, nothing reads R23 (so we should sink it out of the loop in case something after the loop needsR8 | 370
in R23 later; we know nothing in the loop does.)5 SUB R5,R4,R8
->7 ADD R19,R5,R8
dead ends there; nothing reads R5 or R19 so those could similarly be sunk out of the loop. (R4 and R8 will still have the same value after the loop.)