On a computer with an Intel CPU marketed "6 cores / 12 threads", I want to run as many processes as possible, each of them doing math similar computations (each process has a single thread) with different input data. There is no GPU involved, and no inter-process communication is needed.
What is the optimal number of parallel processes of the same executable doing math computations?
Should I run 6 processes (one per physical core)? Or 12 processes (one per thread / virtual core)?
If one process does, say, 1000 computations per second, I'm pretty sure that running 6 of them will run at ~1000/sec each (so a total of ~6000/sec).
But won't running 12 processes make them only 500 computations per second each?
TL;DR: should I run one process per "core" or one process per "thread" on a "6 cores/12 threads Intel CPU"?
CodePudding user response:
It is very dependent of the actual computing code. Some application can benefit from hyper-threading while some do not. High-performance application rarely benefit from hyper-threading so using 1 process per core is certainly the best configuration assuming the code is compute bound and scale well.
Multiple hyper-threads of recent Intel processors (eg. Skylake/Icelake) can share some execution ports. As a result, the overall execution can be faster if one process is not able to saturate the ports. In practice, this is a bit more complex (modern processor are very complex) since compute-bound processes can be bound by other part of the processor like instruction decoding or more tricky low-level units.
For example, the following C code should benefit from hyper-threading (assuming no fast-math optimizations are applied and the code is compiler with optimizations):
float sum = 0.f;
for(int i=0 ; i<maxi ; i)
sum = array[i];
Indeed, the latency of a floating-point addition instruction is 3 to 4 cycles while generally 2 of them can be executed per cycle (only 1 before Skylake). This means the code is bound by the latency of the addition instruction chain. Hyper-threads can use the waiting execution port during this time resulting in a up to twice faster execution (other bottleneck cause the execution not to be so fast in practice). If the code is optimized with fast-math optimization, then compilers can unroll the loop and make use of instruction-level parallelism (IPC). A low IPC often means that using hyper-thread may be beneficial, especially if the cause of this low IPC is due to latency issues (eg. instruction latency and cache misses). Unfortunately, this is not always true. For example, the following code should not be faster with hyper-threading:
for(int i=0 ; i<maxi ; i)
out_array[i] = in_array[i];
This is because there is generally 1 execution store port on Intel processor and it should already be saturated with 1 hyper-thread (otherwise it should be memory throughput bound which is not better for hyper-threading). Thus using more hyper-thread should not improve the execution time. In fact, hyper-threading introduces a slight overhead that should cause a slightly slower execution.
The thing is applications are generally much more complex than that and one does not know how math functions are implemented. A a result, this is nearly impossible for a developer to know what is the best configuration without a basic benchmark unless the computing kernel is simple.