I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i ) { result = 10 * i x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c 11 -pthread example.cpp -o example
and run it using mpirun: mpirun -np 1 example
.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c 11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.
CodePudding user response:
mpirun -np 1 ./example
assigns a single core to your program (so 20 threads end up time sharing): this is the default behavior for Open MPI (e.g. 1 core per MPI process when running with -np 1
or -np 2
.
./example
(e.g. singleton mode) should use all the available cores, unless you are already running on a subset.
If you want to use all the available cores with mpirun
, you can
mpirun --bind-to none -np 1 ./example