C / MPI: Why do I get bigger execution time with bigger number of processors-CodePudding

I'm trying to write a code for serial and parallel algorithm of LDLT decomposition in C using MPI. Here's a code

#include <iostream>
#include <mpi.h>
#include <cassert>
#include <chrono>
#include <cmath>

namespace para
{
    double solveSym(int n, double* a, double* b)
    {
        int nproc, myid;
        MPI_Comm_size(MPI_COMM_WORLD, &nproc);
        MPI_Comm_rank(MPI_COMM_WORLD, &myid);

        int i = myid;

        if (myid != 0)
        {
            double data;
            MPI_Request rreq;
            MPI_Irecv((void*)&data, 1, MPI_DOUBLE, myid - 1, 0, MPI_COMM_WORLD, &rreq);
            MPI_Status st;
            MPI_Wait(&rreq, &st);
            a[i * n   i] = data;
        }

        double invp = 1.0 / a[i * n   i];
        int send_size = 0;
        int recv_size = 1;

        for (int j = i   1; j < n; j  )
        {
            send_size  ;
            recv_size  ;

            if (myid != 0)
            {
                double* data = new double[j];
                MPI_Request rreq;
                MPI_Irecv((void*)data, recv_size, MPI_DOUBLE, myid - 1, 0, MPI_COMM_WORLD, &rreq);
                MPI_Status st;
                MPI_Wait(&rreq, &st);

                for (int k = 0; k < recv_size; k  )
                    a[j * n   (myid   k)] = data[k];
            }

            double aji = a[j * n   i];
            a[j * n   i] *= invp;

            for (int k = i   1; k <= j; k  )
                a[j * n   k] -= aji * a[k * n   i];

            if (myid != nproc - 1)
            {
                MPI_Request sreq;
                double* send_data = new double[send_size];
                
                for (int k = 0; k < send_size; k  )
                    send_data[k] = a[j * n   (i   1   k)];

                MPI_Isend((void*)send_data, send_size, MPI_DOUBLE, myid   1, 0, MPI_COMM_WORLD, &sreq);
                MPI_Status st;
                MPI_Wait(&sreq, &st);
            }
        }
        return 0;
    }
}


namespace seq
{
    void symMatVec(int n, double* a, double* x, double* y)
    {
        int i, j;

        for (i = 0; i < n; i  )
        {
            double t = 0.0;
            for (j = 0; j <= i; j  )
                t  = a[i * n   j] * x[j];

            for (j = i   1; j < n; j  )
                t  = a[j * n   i] * x[j];

            y[i] = t;
        }
    }

    void solveSym(int n, double* a, double* x, double* b)
    {
        for (int i = 0; i < n; i  )
        {
            double invp = 1.0 / a[i * n   i];

            for (int j = i   1; j < n; j  )
            {
                double aji = a[j * n   i];
                a[j * n   i] *= invp;

                for (int k = i   1; k <= j; k  )
                    a[j * n   k] -= aji * a[k * n   i];
            }
        }

        for (int i = 0; i < n; i  )
        {
            double t = b[i];

            for (int j = 0; j < i; j  )
                t -= a[i * n   j] * x[j];

            x[i] = t;
        }

        for (int i = n - 1; i >= 0; i--)
        {
            double t = x[i] / a[i * n   i];

            for (int j = i   1; j < n; j  )
                t -= a[j * n   i] * x[j];

            x[i] = t;
        }
    }
}




int main(int argc, char** argv)
{
    srand((unsigned)time(NULL));

    MPI_Init(&argc, &argv);

    int nproc, myid;
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);

    int n = nproc;

    double* a = new double[n * n];
    assert(a != NULL);

    for (int i = 0; i < n; i  )
        for (int j = 0; j < i; j  )
            a[i * n   j] = rand() / (RAND_MAX   1.0);

    for (int i = 0; i < n; i  )
    {
        double s = 0.0;
        for (int j = 0; j < i; j  )
            s  = a[i * n   j];

        for (int j = i   1; j < n; j  )
            s  = a[j * n   i];
        a[i * n   i] = s   1.0;
    }

    double start, end;
    double* xx = new double[n];
    assert(xx != NULL);

    for (int i = 0; i < n; i  )
        xx[i] = 1.0;

    double* b = new double[n];
    assert(b != NULL);
    seq::symMatVec(n, a, xx, b);

    MPI_Barrier(MPI_COMM_WORLD);
    start = MPI_Wtime();
    double x = para::solveSym(n, a, b);
    MPI_Barrier(MPI_COMM_WORLD);
    end = MPI_Wtime();

    double* output = new double[n];
    MPI_Gather((void*)&x, 1, MPI_DOUBLE, (void*)output, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    if (myid == 0)
    {
        std::cout << "processors num = " << nproc << " execution time = " << (end-start)/1000.0 << " seconds" << std::endl;
    }

    MPI_Finalize();
    return 0;
}

While I execute this code (4 processors, matrix 100x100) using:

mpiexec -np 4 LDLT 100

I get strange results. For example, with matrix of 100x100 using 1 processor, the execution time is 1,2e-9 seconds; using 2 processors, the execution time is 5,48e-9 seconds; using 4 processors, the execution time is 5,55e-9 seconds.

Why do I get such results? What's wrong with this code? Help me to correct it. Thanks!

EDIT: I made some changes according to y'all suggestions, it has some improvements in execution time (now it's not so little), but still the same problem: I changed the matrix size to 1000x1000, and with 1 processor the execution time = 0,0016 seconds; with 2 processors it takes 0,014 seconds. Here's a code of main() function:

int main(int argc, char** argv)
{
    srand((unsigned)time(NULL));

    MPI_Init(&argc, &argv);

    int nproc, myid;
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);

    int n = atoi(argv[1]);

    double* a = new double[n * n];
    assert(a != NULL);

    for (int i = 0; i < n; i  )
        for (int j = 0; j < i; j  )
            a[i * n   j] = rand() / (RAND_MAX   1.0);

    for (int i = 0; i < n; i  )
    {
        double s = 0.0;
        for (int j = 0; j < i; j  )
            s  = a[i * n   j];

        for (int j = i   1; j < n; j  )
            s  = a[j * n   i];
        a[i * n   i] = s   1.0;
    }

    double start, end;
    double* xx = new double[n];
    assert(xx != NULL);

    for (int i = 0; i < n; i  )
        xx[i] = 1.0;

    double* b = new double[n];
    assert(b != NULL);

    start = MPI_Wtime();
    
    if (nproc == 1)
    {
        seq::symMatVec(n, a, xx, b);
        end = MPI_Wtime();
        std::cout << "processors num = " << nproc << " execution time = " << (end - start) << " seconds" << std::endl;
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Finalize();
    }
    else
    {
        double x = para::solveSym(n, a, b);

        double* output = new double[n];
        MPI_Gather((void*)&x, 1, MPI_DOUBLE, (void*)output, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

        if (myid == 0)
        {
            end = MPI_Wtime();
            std::cout << "processors num = " << nproc << " execution time = " << (end - start) << " seconds" << std::endl;
        }
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Finalize();
    }
    
    return 0;
}

CodePudding user response：

Probably this happens because the problem is so small (1.2e-9 s execution time) that the overhead associated with the MPI calls takes much more time than what is needed for the actual computation. Maybe you can try with much larger matrices and share the results.

CodePudding user response：

Firstly, the problem size could be an issue. The time spent on MPI communication (Though MPI_Isend and Irecv is used, it will still behave like MPI_Send and MPI_Recv because of MPI_Wait immediately after the calls - i.e - no communication computation overlap) and synchronisation (MPI_Barrier before the timing calculation) might be more than the actual compute time.

On the synchronisation aspect, your approach for measuring time is flawed (in my opinion).

Ideally, what you should do is, you should calculate the time taken for calling the solveSym function like this.

start = MPI_Wtime();
double x = para::solveSym(n, a, b);
end = MPI_Wtime();
MPI_Barrier(MPI_COMM_WORLD);

Putting barrier will skew your results there. Then after that, all process should calculate its time taken for the function. Then do an MPI_Reduce and calculate the average time to get the correct timing information. Currently, you are only printing the timing information from one single process.

Suggestion (as pointed out by davideperrone in his answer): You should increase the problem size and run the code. If the behaviour persists then, you should change the MPI implementation of your code (By overlapping computation with communication if possible).

If I am not mistaken, your function solveSym always return 0 and you are using MPI_Gather to collect it.

From the quick glance at the code, what I can see is the implementation is very inefficient. All processes has all the data (all mpi process creates n*n array etc). In my opinion, you can improve your code a lot.