Maximizing the performance and efficiency of triangularizing a 24x24 matrix in C and then in MIPS as-CodePudding

As of recently an interest within the realm of computer architecture and performance has been sparked in me. With that said, I have been picking up an "easier" assembly language to really try and learn how stuff "works under the hood". Namely MIPS assembly. I feel comfortable enough to try and experiment with some more advanced stuff and as such I have decided to combine programming with my interest in mathematics.

My goal is simple, given a 24x24 (I don't care about any other size) matrix A, I want to write an algorithm that as efficiently as possible finds the upper triangular form of the matrix. With efficiently I mean that I want to eventually end up in a state where I use the processor's that I am using resources the best I can. High cache hit rate, efficient usage of memory (locality of reference principle etc.), performance as in time it takes to run the solution, etc.

Eventually my goal is to transform the C solution to MIPS-assembly and tailor it to fit the memory subsystem of the processor that I will be trying to run my algorithm on. Regarding the processor I will have different options to play around with when it comes to caches, write buffers and memory in the sense that I can play around with different cache sizes, block sizes, associativity levels, memory access times etc. Performance in this case will be measured in the time it takes to triangularize a 24x24 matrix.

To begin, I need to actually write some high level code and actually solve the problem there before diving into MIPS assembly. I have "looked around" and eventually came up with this seemingly standard solution. It isn't necessarily super fast, neither do I think it is optimal for triangularizing 24x24 matrices. Can I do better?

void triangularize(float **A, int N)
{
    int i, j, k;
    // Loop over the diagonal elements
    for (k = 0; k < N; k  )
    {
        // Loop over all the elements in the pivot row and right of the pivot ELEMENT
        for (j = k   1; j < N; j  )
        {
            // divide by the pivot element
            A[k][j] = A[k][j] / A[k][k];
        }
    
        // Set the pivot elements
        A[k][k] = 1.0;

        // Loop over all elements below the pivot right an right of the pivot COLUMN
        for (i = k   1; i < N; i  )
        {
            for (j = k   1; j < N; j  )
            {
                A[i][j] = A[i][j] - A[i][k] * A[k][j];
            }
            A[i][k] = 0.0;
        }
    }
}

Furthermore, what should be my next steps when trying to convert the C code to MIPS assembly with respect to maximizing performance and minimizing cost (cache hit rates, IO costs when dealing with memory etc.) to get a lightning fast and efficient solution?

CodePudding user response：

First of all, encoding a matrix as a jagged array (ie. float**) is generally not efficient as it cause unnecessary expensive indirections and the array may not be contiguous in memory resulting in more cache misses or even cache trashing in pathological cases. It is certainly better to copy the matrix in a contiguous flatten array. Please consider storing your matrices as flatten arrays that are generally more efficient (especially on MIPS). Flatten array can be indexed using something like array[i*24 j] instead of array[i][j].

Moreover, if you do not care about matrices other than 24x24 ones, then you can write a specialized code for 24x24 matrices. This help compilers to generate a more efficient assembly code (typically by unrolling loops and using more efficient instructions like multiplication by a constant).

Additionally, divisions are generally expensive, especially on embedded MIPS processors. Thus, you can replace divisions by multiplications with the inverse. For example:

float inv = 1.0f / A[k][k];

for (j = k   1; j < N; j  )
    A[k][j] *= inv;

Note that the result might be slightly different due to floating-point rounding. You can use the -ffast-math compiler flag so to help it generating such optimisation if you know that special values like NaN or Inf do not appear in the matrix.

Moreover, it may be faster to unroll the loop manually since not all compilers do that (properly). That being said, the benefit of loop unrolling is very dependent of the target processor (unspecified here). Without more information, it is very hard to know if this is useful. For example, some processor can execute multiple floating-point operation per cycles while some other cannot even do that natively (ie. no hardware FP unit): they are somehow emulated with many instruction which is very expensive (compilers like GCC do function calls for basic operations like addition/subtraction on such processors). If there is no hardware FP unit, then it might be faster to use fixed precision.

Finally, some MIPS processors have a 128-bit SIMD unit. Using it should significantly speed up the execution. Compilers should be able to mostly auto-vectorize your code but you need to tell them if your target processor support it (see the -march flag for GCC/Clang). For a fixed-size matrix, manual vectorization often result in a faster execution (than auto-vectorisation) assuming you write an efficient code.