Case of using OpenMP for multi-threading of a matrix factorization calculation of an existing serial-CodePudding

I came across a code that uses a low-performance series for loop for calculating so-called "Crout factorization". If you need to know more I think the concept is similar to what is described here. At first glance the code is a for loop that can simply become parallel by an omp directive:

SparseMtrx *Skyline :: factorized()
{
    // Returns the receiver in  U(transp).D.U  Crout factorization form.
        ...
        // this loop is very expensive and only uses single thread:
        for ( int k = 2; k <= n; k   ) {
            int ack = adr.at(k);
            int ack1 = adr.at(k   1);
            int acrk = k - ( ack1 - ack )   1;
            for ( int i = acrk   1; i < k; i   ) {
                int aci = adr.at(i);
                int aci1 = adr.at(i   1);
                int acri = i - ( aci1 - aci )   1;
                int ac;
                if ( acri < acrk ) {
                    ac = acrk;
                } else {
                    ac = acri;
                }
    
                int acj = k - ac   ack;
                int acj1 = k - i   ack;
                int acs = i - ac   aci;
                double s = 0.0;
                for ( int j = acj; j > acj1; j-- ) {
                    s  = mtrx [ j ] * mtrx [ acs ];
                    acs--;
                }
    
                mtrx [ acj1 ] -= s; //mtrx  here is the matrix values (shared)
            }
            double s = 0.0;
            for ( int i = ack1 - 1; i > ack; i-- ) {
                double g = mtrx [ i ];
                int acs = adr.at(acrk);
                acrk  ;
                mtrx [ i ] /= mtrx [ acs ];
                s  = mtrx [ i ] * g;
            }
    
            mtrx [ ack ] -= s;
        }
    ...

But the problem is it seems that each step uses the updated value of the matrix values, mtrx, to calculate the next factorized values.

My question is there any way to use multi-threading to do this calculation?

The complete code is available in github via this link.

CodePudding user response：

The Crout factorization is one variant of Gaussian elimination. You can characterize such algorithms by 1. their end-product 2. how they go about it. The Crout factorization computes LU where the diagonal of U is identity. Other factorizations normalize the diagonal of L, or they compute LDU with both L,U are normalized, or they compute LU with the diagonals of L & U equal. All that is beside the point for parallelization.

Next you can characterize Gaussian Elimination algorithms by how they do the computation. These are all mathematically equivalent re-organizations of the basic triply nested loop. Since they are mathematically equivalent, you can pick one or the other for favorable computational properties.

To get one thing out of the way: Gaussian Elimination has an intrinsically sequential component in the computation of the pivots, so you can not make it fully parallel. (Ok, some mumbling about determinants, but that's not realistic.) The outer loop is sequential with a number of steps equal to the matrix size. The question is whether you can make the inner loops parallel.

The Crout algorithm can be characterized as that in step "k" of the factorization it computes row "k" of U, and column "k" of L. Since the elements in both are not recursively related, these updates can be done in a parallel loop, which gives you a loop on "k" that is sequential, and for each k value two single loops that are parallel. That leaves the 3rd loop level: that one comes from the fact that each of the independent iterations involves a sum, therefore a reduction.

This does not sound great for parallelism: you can not collapse two loops if the inner is a reduction, so you have to choose which level to parallelize. Maybe you should use a different formulation of Gaussian Elimination. For instance, the normal algorithm is based on "rank-k updates", and those are very parallel. That algorithm has a single sequential loop, with in each step a collapse(2) parallel loop.