I want to know why when executing this assembly code on a pipelined RiscV - that does not stall automatically - with forwarding (except for internal register file WB->DEC forwarding) we need to place two NOP commands immediatly after the third command, wouldn't one NOP suffice?
addi t0, x0, 0
addi t1, x0, 5
addi s1, x0, 0x200 //why are two NOPS required after this command?
beq t1, t0, finish
Here's my line of thinking - after one nop the first command finished compiling, and we can forward t1 from the second command's WB into the EXE of the beq. Where am I wrong?
CodePudding user response:
So after working on this for a few hours, here's the solution: two key facts are needed:
- Beq can only be forwarded to from WB, since it's branch condition is calculated the branch comperator and forwarding only exists to the ALU.
- as per the questions instructions, we can't forward from WB->DEC, so essentially we can't forward to Beq. Let's write the stages and "run the program":
IF DEC EXE MEM WB
1
2 1
3 2 1
4 3 2 1
4 3 2 1
- notice we can't execute 4 (beq t1, t0, finish) since it's dependant on t1's value from instruction 2. We have to wait for t1's value. MEM->DEC forwarding doesn't exist. we can only fetch a new t1 at the DEC stage since all the forwarding to EXE links up to the ALU and we calculate the branch condition at the comperator whch we can't effect, hence we must wait and place a single NOP. let's continue.
IF DEC EXE MEM WB
4 NOP 3 2
- notice we STILL can't do anything - we're waiting for t1 but we don't have WB->DEC forwarding (as was stated in the question), so we must wait for 2 to finish it's WB stage at the DEC so that we can take t1's updated value, hence we must place another NOP. Let's continue.
IF DEC EXE MEM WB
4 NOP NOP 3 - notice 2 has finished, we can now continue with the correct t1.
4 NOP NOP
4 NOP
4
DONE.
yup
CodePudding user response:
As Erik said, there should not be a need for a NOP
instruction. The CPU implementation should handle the dependencies and stall the pipeline when needed. If for some reason, the implementation doesn't do it(I would refer to this as a BUG), there are workarounds to fix it on a later stage, compiler that injects nops when detecting dependencies etc.
If the CPU supports forwarding, as you said on a traditional 5 stage pipelined CPU, than there is no need for NOP
. When BEQ
instruction hits the CPU decode stage, t0
is already written to register file while t1
can be forwarded.