Solution of sparse linear system on GPU, paper from nvidia-CodePudding

I am reading through an Nvidia article about solving linear systems (sparse) on GPU. I got stuck on the construction of the chainPtrHost data structure. I understand what it does but I don't understand how it's constructed. It is mentioned that it depends on the parallelism across multiple levels but how that is determined doesn't seem to be explained.

Does anyone know how it can be constructed?

Paper for reference: Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU

CodePudding user response：

My understanding is that one just iterates through the levels sequentially on the host like in the following, very basic, untested C snippet:

std::vector<int> create_chains(std::vector<int> const &levels, int threshold) {
    std::vector<int> chains;
    chains.reserve(levels.size());

    bool no_open_chain = true;

    for (int i = 0; i < levels.size() - 1;   i) {
        int const num_rows = levels[i   1] - levels[i];
        bool const new_single_link_chain = (num_rows > threshold);

        if (no_open_chain || new_single_link_chain) { // new chain starts here
            chains.push_back(i   1); // 1-based index
        }

        no_open_chain = new_single_link_chain;
    }
    chains.push_back(levels.size()); // end of last chain

    return chains;
}

The right value for threshold depends on the implementation of the single-block solver kernel. If each thread operates on one row and there is no restriction on the block size for occupancy/shared memory/etc., it could be 1024, i.e. the maximum number of threads (rows) per block.