I'm optimizing a solver (systems of linear equations) whose most critical part consists of
Many (1000 ) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:
for(i=0;i<num_iter;i )
{
// SYNC-POINT
parallel_for(j=0;j<array_size;j )
array_out[j] = some_function( array_in[j] )
swap( array_in, array_out );
}
Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far (serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))
CodePudding user response:
The general solution to this problem is to create the threads and distribute the work only once, and then use fast synchronization point in the threads. In this case, the outer loop is moved in the threaded function. This is possible with the TBB library by providing a range (tbb::blocked_range<size_t>
) and a function to tbb::parallel_for
(see an example here).
Barriers are known to scale poorly on many core architectures, especially in such codes. The only way to make the program scale is to reduce the synchronization between threads so to make it more local. For example, for stencils, the solution is to replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key to solve such problem. In fact, note the fundamental laws of physics prevent barriers to scale because clocks cannot be fully synchronized in general relativity and computers (unfortunately) obeys to physics law.
Many core systems are now nearly always NUMA ones. Regarding your configuration, you certainly use an AMD Threadripper processor which have multiple NUMA nodes. This means you should care about locality and the NUMA allocation policy. The default policy is generally the first touch. This means that is your initialization is sequential or threads are mapped differently, then cores have to fetch data from remote NUMA nodes which is slow. In the worst case, all cores can access to the same NUMA node and saturate it resulting in a possibly slower execution than the sequential version. You should generally make it parallel for better performance. Getting high-performance on such architecture is far from being easy. You need to carefully control the allocation policy (numactl can help for that), the initialization (parallel), the thread binding (with taskset
, hwloc
and/or environment variables) and the memory access pattern (by reading articles/books about how NUMA machines work and applying dedicated algorithms).
By the way, there is probably a false-sharing effect happening in your code because cache lines of array_out
are certainly shared between thread. This should not have a critical impact but it does contribute to get a poor scalability (especially on your 64-core processor). The general solution to this problem is to align the array in memory on a cache line and take take the parallel splitting is done on a cache line boundary. Alternatively, you can allocate the array part in each thread. This is generally a better approach as is ensure data is locally allocated, locally filled and make data-sharing/communication between NUMA nodes and even cores more explicit (ie. better control), though it can make the code more complex (there is no free lunch).
CodePudding user response:
Sharing data across threads is slow. This is because each CPU core has at least one layer of personal cache. The minute you share data between cores/threads, the personal caches need to be synchronised which is slow.
Threads running in parallel across different cores work best if they do not share data.
In your case, if data is read only you might be best off giving each thread its own copy of the data. For read write data, you have to accept the synchronisation slowdown.