numba and slurm submission with multiple nodes-CodePudding

I want to make a python script which includes many numba njitted functions with parallel=True to use all the cores I request on a cluster.

On the cluster, I can only request the number of cores I want to use, via #SBATCH -n no_of_cores_you_want.

At the moment, having something like:

#SBATCH -n 150
NUMBA_NUM_THREADS=100 python main.py

makes the main.py to output that numba.config.NUMBA_DEFAULT_NUM_THREADS=20 and numba.config.NUMBA_NUM_THREADS=100. My explanation for this is that a node on the cluster is composed of 20 single threaded cores, looking at its specs.

How can I make the main.py to use all the cores the cluster gives to me? I underline that I only want the main.py to be run once only, and not multiple times. The aim is that single run to make use of all the available cores (located on multiple separate nodes). (The numba.config.NUMBA_NUM_THREADS is 100 because if I set it to 150, a slurm error appears. It can probably be higher than 100, but it is mandatory to be less than 150.)

CodePudding user response：

A computing cluster is not just a bag of cores. It is far more complex. To begin with, a modern mainstream cluster is basically a set of computing nodes (interconnected with a network). Each node contains one or multiple microprocessors. Each microprocessor has many cores (typically dozens nowadays). Each cores can have multiple hardware threads (typically 2). Nodes have their own memory and a process cannot access to the memory of a remote node (unless the hardware support it or a software abstract this). This is called distributed memory. Core of a node share the same main memory. Note that in practice the access is generally not uniform (NUMA): some cores often have a faster access to some part of the main memory (and if you do not care about that your application can poorly scale).

This means you need to write a distributed application so to use all the cores of a cluster. MPI is a good way to write such an application. Numba does not support distributed memory, only shared memory, so you can only use one computing node with it. Note that writing distributed application is not trivial. Note also that you can mix MPI codes with Numba.

By the way, please consider optimizing your application before using multiple nodes. It is often simpler, but it is also less expensive, use less energy and it makes your application easier to maintain (debugging distributed applications is tricky).

Also note that using more threads than available cores on a node cause an over-subscription which often results in a severe performance degradation. If your application is well optimized, hardware threads should not improve the performance and can even slow down your application.