I am currently trying to optimize some MIPS assembler that I've written for a program that triangulates a 24x24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.
Part of the algorithm involves the following loop that I'm trying to unroll (N will always be 24)
...
float inv = 1/A[k][k]
for (j = k 1; j < N; j ) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
}
...
I want
...
float inv = 1/A[k][k]
for (j = k 1; j < N; j =2) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
A[k][j 1] = A[k][j 1] * inv;
}
...
but it generates the incorrect result and I don't know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.
Here is my attempt at doing it.
...
# No loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 1 ## j = 2
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f0, 0($v0) ## Perform A[k][j] := A[k][j] * inv
# The matrix triangulates without problem with this original code.
...
...
# One loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 2 ## j = 2
lwc1 $f1, 4($v0) # $f1 <- A[k][j 1]
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
mul.s $f1, $f1, $f2 # Perform A[k][j 1] * inv
swc1 $f0, 0($v0) # Perform A[k][j] := A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f1, 4($v0) ## Perform A[k][j 1] := A[k][j 1] * inv
# The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.
...
CodePudding user response:
The unrolled C loop condition is buggy.
j < N; j =2
can start the loop body with j = N-1
,
accessing the array at A[k][N-1]
(fine) and A[k][N]
(not fine).
One common method is j < N-1
, or in general j < N-(unroll-1)
. But for unsigned N, you also have to separately check N >= unroll
before starting the loop, because N-1
could wrap to a huge unsigned value.
Keeping the j < limit
is generally good for C compilers vs. j 1 < N
which is a separate thing they'd have to calculate. And can also stop a compiler from proving that the loop isn't infinite for unsigned counts (like size_t
), because that's well-defined as wrapping around, so N = UINT_MAX could lead to the condition always being true depending on the starting point. (e.g. j = UINT_MAX-2 makes UINT_MAX-1 < UINT_MAX
, and j =2
makes 0 < UINT_MAX
, also true.) So it's a similar problem to using j <= limit
for unsigned counters. Compilers really like to know when a loop is potentially infinite. For some, that it disables auto-vectorization if the trip-count isn't calculable ahead of the first iteration.
If j
was starting at 0
, you can get away with a sloppy condition if N was guaranteed to be a multiple of the unroll factor. But here it's different, as Nate points out.
efficiency of your MIPS asm
generally the point of loop unrolling is performance. A non-inline call to a helper function inside the loop is kind of defeating the purpose.
jal getelem
I assume does a bunch of multiplies and stuff to redo the indexing with a pointer and two integers? Notice that you're scanning along contiguous memory in one row, so you can just increment a pointer.
Calculate an end-pointer to compare against, so your MIPS loop can look like
# some checking outside the loop, maybe with a bxx to the end of it.
looptop: # do{
lwc1 $f2, 0($t0)
lwc1 $f3, 4($t0)
addiu $t0, $t0, 4*2 # p =2 advance by 8 bytes, 2 floats
...
swc1 something, 0($t0)
swc1 something, 4($t0)
bne $t0, $t1 # }while(p!=endp)
# maybe another condition to check if you should run one last iteration.
MIPS bltu
is only a pseudo-instruction (sltu/bnez); that's why it's better to calculate an exact end-pointer so you can use a single machine instruction as the loop branch.
And yes, this might mean rounding the iteration count down to a multiple of 2 to ensure correctness. Or doing a scalar iteration and rounding up to a multiple of 2. e.g. x
/ x&=-2;
With software pipelining, e.g. doing a load and divide but not a store yet, you could maybe let the rounding-up have the loop redo that element if odd. (If the chance of a branch mispredict costs more than an FP multiply and a redundant store.) Haven't fully thought this through, but it's a similar idea to SIMD doing a first unaligned vector, then a potentially-partially-overlapping aligned vector. (SIMD vectorization is like unrolling, but then you roll back up into a single instruction that does 4 elements, for example.)