MPI with C slower if more processes are used-CodePudding

I am learning MPI with C and I wrote a code based on the one presented in this link: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml.

In this code a vector containing 1e8 values are summed. However, I am observing that when using more processes the run time is getting bigger. The code is given bellow:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired   root process. For example: if * = 3, the vector will be splited in two.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){
        // Root process: Define vector size, how to split vector and send information to workers
        vec_size = 1e8; // size of main vector

        for(i = 0; i < vec_size; i  ){
            //vec1[i] = pow(-1.0,i 2)/(2.0*(i 1)-1.0); // defining main vector...  Correct answer for total sum = 0.78539816339
            vec1[i] = pow(i,2) 1.0; // defining main vector... 
            //printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
        }

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // spliting and sending the values
        
        for(an_id = 1; an_id < num_proc; an_id  ){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_send = rows_per_proc   leftover; // counting the amount of data to be sent.
                start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
            }
            else{
                num_2_send = rows_per_proc;
                start_point = (an_id - 1)*num_2_send   leftover; // starting point for other processes if there is leftover.
            }
            
            ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
            ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
        }

        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id  ){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum   partial_sum;
        }

        printf("Total sum = %f.\n", sum);

    }
    else{
        // Workers:define which operation will be carried out by each one
        ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
        ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
        
        partial_sum = 0;
        for(i=0; i < num_2_recv; i  ){
            //printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
            partial_sum = partial_sum   vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
        
    }

    ierr = MPI_Finalize();
    
}

Obs.: Compile as


mpicc -o vector_sum vector_send.c -lm

and run as:

time mpirun -n x vector_sum

with x = 2 and 5. You will see that with x=5 it takes more time to run.

Did I do something wrong? I did not expected it to be slower, since the summation of each chunk is independent. Or it is a matter of how the program is sending the information for each process? It seems to me that the loops for sending the information for each process is the responsible for this longer time.

CodePudding user response：

As suggested by Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet): I modified the code to generate the vector pieces in each process instead of passing them from the root process. It worked! Now the elapsed time is smaller for more processes. I am posting the new code bellow:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will generate and calculate the partial sum of the vector values and send it back to the root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired   root process. For example: if * = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){

        vec_size = 1e8; // defining main vector size

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // defining the number of data and position corresponding to main vector
        
        for(an_id = 1; an_id < num_proc; an_id  ){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_gen = rows_per_proc   leftover; // counting the amount of data to be generated.
                start_point = (an_id - 1)*num_2_gen; // defining corresponding initial position in the main vector.
            }
            else{
                num_2_gen = rows_per_proc;
                start_point = (an_id - 1)*num_2_gen   leftover; // defining corresponding initial position in the main vector for other processes if there is leftover.
            }

            ierr = MPI_Send(&num_2_gen, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data must be generated.
            ierr = MPI_Send(&start_point, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of initial positions on main vector.
        }
        
        
        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id  ){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum   partial_sum;
        }

        printf("Total sum = %f.\n", sum);
        
    }
    else{
        ierr = MPI_Recv(&num_2_gen, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must generate.
        ierr = MPI_Recv(&start_point, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of initial positions.

        // generate and sum vector pieces
        partial_sum = 0;
        for(i = start_point; i < start_point   num_2_gen; i  ){
            vec2[i] = pow(i,2) 1.0;
            partial_sum = partial_sum   vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
               
    }

    ierr = MPI_Finalize();
    return 0;
    
}

In this new version, instead of passing the main vector pieces, it is passed the just the information of how generate those pieces in each process.

CodePudding user response：

The new code using MPI_Reduce() is faster and simpler than the previous one:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.

Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired   root process. For example: if x = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process;
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    vec_size = 1e8; // defining the main vector size
    
    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
    rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
    leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.

    if(my_id == 0){
        num_2_gen = rows_per_proc   leftover; // if there is leftover, it is calculate in process 0
        start_point = my_id*num_2_gen; // the corresponding position on the main vector
    }
    else{
        num_2_gen = rows_per_proc;
        start_point = my_id*num_2_gen   leftover; // the corresponding position on the main vector
    }

    partial_sum = 0;
    for(i = start_point; i < start_point   num_2_gen; i  ){
        vec2[i] = pow(i,2)   1.0; // defining vector values
        partial_sum  = vec2[i]; // calculating partial sum
    }

    printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);

    MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum

    if(my_id == root_process){
        printf("Total sum is %f.\n", sum);
    }

    ierr = MPI_Finalize();
    return 0;
    
}