I'm designing very compute heavy algorithm, and I'm more constrained by time than access to remote machines on which to run the algorithm.
My question is the following:
Let's say each machine I have access to has 24 cores, and I have 48 tasks to run. Currently, I'm dispatching the algorithm to two machines, each of which use their 24 cores to handle 24 of the tasks.
If I instead dispatched the same process to 4 machines which each spawned 12 threads, would it (likely) result in the tasks being completed quicker? I'm curious if having some extra cores available on the machine means computations are performed faster than if every single core is occupied running an individual thread.
CodePudding user response:
This is highly dependent of the actual algorithm, the actual dataset, the target hardware including the interconnection network if data communicate and the input/data data are heavy (or if the algorithm runs very quickly). Some applications scale better on many machines with few cores and some scale better on few machines with many cores. In high-performance computing researchers worked during decades so to understand the performance of hybrid applications and there is no clear answer to this: it depends (Note that the question is already quite hard to answer for a given well defined application with a well defined dataset so for people to write research papers on it).
If your tasks are memory-bound, then using more machine with less core is often better. If the amount of transferred data is big or the algorithm require a low-latency, then using fewer machines is often better (typically one big SMP). There are many others things to consider since machines are not just a bag of core. NUMA effect should be considered for example as well as caches, the storage device system and even the OS (not all subsystem scale on a given machine regarding the OS).