I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).
However, I thought it would be worthwhile to use MPI_Wtime()
to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.
#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size, start_val, end_val, upperLimit;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
upperLimit = 33;
start_val = rank * (upperLimit / size) 1;
int num1 = 1; int num2 = 1; int SumNum = num1 num2; int x = 0;
double start, end;
// begin timing
start = MPI_Wtime();
// arbitrarily inflate the number of computations
// to make the program take longer to compute
// change to t < 1 for only 1 computation
for (int i = 0; i < 10000000; i ) {
// generate an algorithim that defines the range of
// each process to handle for the fibb_sequence problem.
if (rank == (size - 1)) {
end_val = upperLimit;
}
else {
end_val = start_val (upperLimit / size) - 1;
}
/*
calculations before this code indicate that it will take exactly 32 seperate algorithim computations
to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
computation, but this calculation will not be listed in this code.
*/
long double fibb_const = (1 sqrt(5)) / 2; int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;
// calculate fibb sequence positions for the sequence using a formula
double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
double position12 = (pow(fibb_const, j 1) - pow(fibb_const1, (j 1))) / (sqrt(5));
position11 = floor(position11);
position12 = floor(position12);
// dynamically assign values to each process to generate a solution quickly
if (rank == 0) {
for (int i = start_val; i < end_val; i ) {
SumNum = num1 num2;
num1 = num2;
num2 = SumNum;
if (SumNum % 2 == 0) {
x = x SumNum;
//printf("Process 0 reports %d \n \n", SumNum);
//fflush(stdout);
}
}
}
else {
for (int i = start_val; i < end_val; i ) {
SumNum = position12 position11;
if (SumNum % 2 == 0) {
x = x SumNum;
//printf("Process %d reports %d \n \n", rank, SumNum);
//fflush(stdout);
}
position11 = position12;
position12 = SumNum;
}
}
int recieve_buf = 0;
MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
//printf("This is the final solution: %d \n \n", recieve_buf);
//fflush(stdout);
}
}
// end timing
end = MPI_Wtime();
// timer goes here
double elapsed_time = end - start;
printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);
// end the MPI code
MPI_Finalize();
return 0;
}
Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.
I have attempted to solve this problem by utilizing time.h
and chrono
in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.
How I Run my Code
I use mpiexec -n <process_count> <my_file_name>.exe
to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c
followed by mpiexec -n <process_count> ./a.out
. All my best efforts seem to produce data contrary to my expectations.
Hopefully this question isn't too vague. I will provide more information if needed.
System Info
I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again
CodePudding user response:
This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization.
Let's say your MPI_Reduce
takes a couple of microseconds. That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.