Initialising two large 2D array of 401X401 and do fast matrix multiplication in C-CodePudding

I want to initialise two 2D matrices of size 401X401 and multiply them in a speedy way. But most probably due to stack overflow, two double 2D matrices were not initialised as stated in this question: No output after new arrays being initialized in C .

After following suggestions, I used vectors of vectors to store my 2D matrix. But I wanted to do fast matrix multiplication as time is a significant factor for me. There were suggestions not to use vectors of vectors for this purpose: How can we speedup matrix multiplication where matrices are initialized using vectors of vectors (2D vector) in C .

Maybe I can convert again to array, but I feel again there would be StackOverflow!! How can I do both the task of initialising two large matrices and matrix multiplication is also fast? If I stick to vectors of vectors, there is no way to use abilities of inbuilt libraries such as MKL etc.

CodePudding user response：

You can get the memory on the heap via new. Repeated use will not really help here as new is not a fast operation. You can skirt this problem, by getting one large block of memory once and pretending it's 2D. You can do this like:

double* raw_data = new double[401*401];
double* matrix[401];
for(size_t i = 0; i < 401;   i)
    matrix[i] = raw_data   401*i;
\\ Do_stuff
delete[] raw_data

You could do similar things with memory managed by vector I think, or something with smart pointers as well. This will of course not be as fast as allocating on the stack, but it will be passable.

With this approach you'll have to be careful of memory leaks as well as remembering that as soon as raw_data is returned matrix gets invalidated as well. Some of these problems are addressed via using some smart containers for raw_data and others you have to keep in mind.

CodePudding user response：

This can be sped up considerably.

First, to eliminate the stack usage issue declare the matrix like so:

std::vector<std::array<double,401>> matrix(401);

This achieves cache coherence since all the matrix memory is contiguous.

Next, transpose one of the matrixes. This will make the multiplication proceed in sequential memory accesses.

And finally, this lends itself to further speed improvements by multithreading. For instance create 4 threads that each process 100,100,100,101 rows. No thread sync required. since all writes are specific to each thread. Just join them all at the end and you're done.