MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques-CodePudding

I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).

However, I thought it would be worthwhile to use MPI_Wtime() to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.

#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>


int main(int argc, char* argv[]) {

    MPI_Init(&argc, &argv);
    int rank, size, start_val, end_val, upperLimit;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    upperLimit = 33;
    start_val = rank * (upperLimit / size)   1;
    int num1 = 1; int num2 = 1; int SumNum = num1   num2; int x = 0;
    double start, end;

    // begin timing
    start = MPI_Wtime();


    // arbitrarily inflate the number of computations
    // to make the program take longer to compute
    // change to t < 1 for only 1 computation
    for (int i = 0; i < 10000000; i  ) {

        // generate an algorithim that defines the range of
        // each process to handle for the fibb_sequence problem.

        if (rank == (size - 1)) {
            end_val = upperLimit;
        }
        else {
            end_val = start_val   (upperLimit / size) - 1;
        }

        /*

        calculations before this code indicate that it will take exactly 32 seperate algorithim computations
        to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
        computation, but this calculation will not be listed in this code.
        */

        long double fibb_const = (1   sqrt(5)) / 2;  int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;

        // calculate fibb sequence positions for the sequence using a formula
        double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
        double position12 = (pow(fibb_const, j   1) - pow(fibb_const1, (j   1))) / (sqrt(5));
        position11 = floor(position11);
        position12 = floor(position12);


        // dynamically assign values to each process to generate a solution quickly

        if (rank == 0) {
            for (int i = start_val; i < end_val; i  ) {
                SumNum = num1   num2;
                num1 = num2;
                num2 = SumNum;
                if (SumNum % 2 == 0) {
                    x = x   SumNum;
                    //printf("Process 0 reports %d \n \n", SumNum);
                    //fflush(stdout);
                }
            }
        }
        else {
            for (int i = start_val; i < end_val; i  ) {

                SumNum = position12   position11;

                if (SumNum % 2 == 0) {
                    x = x   SumNum;
                    //printf("Process %d reports %d \n \n", rank, SumNum);
                    //fflush(stdout);
                }

                position11 = position12;
                position12 = SumNum;

            }

        }
        int recieve_buf = 0;
        MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
        if (rank == 0) {
            //printf("This is the final solution: %d \n \n", recieve_buf);
            //fflush(stdout);
        }

    }

    // end timing
    end = MPI_Wtime();

    // timer goes here
    double elapsed_time = end - start;
    printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);

    // end the MPI code
    MPI_Finalize();
    return 0;
}

Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.

I have attempted to solve this problem by utilizing time.h and chrono in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.

How I Run my Code

I use mpiexec -n <process_count> <my_file_name>.exe to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c followed by mpiexec -n <process_count> ./a.out. All my best efforts seem to produce data contrary to my expectations.

Hopefully this question isn't too vague. I will provide more information if needed.

System Info

I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again

CodePudding user response：

This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization. Let's say your MPI_Reduce takes a couple of microseconds. That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.