Is there a mathematically optimal way to parallelize a given algorithm?-CodePudding

Let us say we have an algorithm with multiple steps and each of those steps can be parallelized. Given a certain number of CPU cores, is there a mathematically optimal way to split things up on the cores. Given there are multiple ways to splits things up and along multiple dimensions, how do you choose which slicing to go for? Is there a well-established mathematical/CS theory behind this?

CodePudding user response：

This is a very hard problem in computer science though it looks simple at first glance. Researchers works since decades on automatic-parallelization and optimal parallel algorithms and this is still an active field of research. Put it shortly, there is no general mathematically optimal way to parallelize a given algorithm. In practice, regarding the target algorithm, there are some specific solution to find good result on real-world hardware but AFAIK no solution is guaranteed to find the practical optimal yet. Humans are still needed for that (and certainly for a long time)!

Let starts with the basics. The Rice's theorem states that all non-trivial semantic properties of programs are undecidable. For example, the halting problem is undecidable. The same thing applies for parallelization of algorithms: knowing if an algorithm can be parallelized is undecidable. Thus, knowing the optimal way is the least of the problems. In practice, this means that there is no algorithm that can automatically say if an algorithm can be parallelized or even find an optimal parallelization. That being said, considering one specific algorithm, you can often say if this is the case (researchers do that but they sometimes takes years to do so).

In practice, you can restrict the set of considered algorithms so to make the problem decidable and in fact even hope to find an optimal solution. The polyhedral model can be used to solve this. Indeed, considering a sequence of nested loops following described in the language of Presburger arithmetic, it is possible to perform an automatic (optimal) parallelization. This theoretically works and use by some compilers in practice. But there is a catch, the complexity is HUGE: O(2^(2^(2^n))). As a result, any complex realistic problem turns out to be impractical to compute. In practice, there are heuristics and approximation to make that practical (but not fast either) and the worst case scenarios rarely occurs. Compilers should take care not to make the problem too complex to solve. There is another big limitation that is coming from the model itself: the Presburger arithmetic. Indeed, this arithmetic do not support operations like modulus, or any user-defined non-trivial operators. In practice, this means that if there is a modulus in a basic nested loops (impacting the iteration space), then compilers cannot optimize them (unless they use a use-case dependent "cheat" so to transform the problem to another without a modulus which is very rare). The same thing applies for a nested loops with an increment that is a runtime-dependent array item.

While the polyhedral model is good to find an optimal solution in theory, the solution may not be optimal on the target hardware. Modern processors have all sort of quirks that make this insane to solve. In practice, compilers succeed to make a relatively good job in basic use-cases (trivial parallel embarrassingly parallel loops and basic reductions) thanks to dozens of interrelated optimizations steps. The optimization is not global (the loop optimization is relatively independent of the code generation) as it would be completely unrealistic to do, even on simple cases. There are a set of tweaks so it can "work in practice" (ie. optimizations steps are tweaked so next ones can generate faster code) but not in general using one mathematical model. Put it shortly, engineering is used to complete the lack of theoretical computer science solution.

Given a well-defined parallel algorithm, it is actually pretty hard to know how to find the best parameters (eg. block size for better caching and SIMD vectorization). The usual solution to this problem is to use auto-tuners so to explore the huge exploration space and optimize a metric (speed). Finding a good metric is challenging regarding the complexity of current hardware. AFAIK, there is no model that even succeed to fit with the reality on practical hardware but some succeed to model quite well specific parts of the hardware (eg. caches). The best approach are based on runtime feedback so to complete the imprecision of theoretical models. The FFTW library is a good example for that.

For more information, please read:

CodePudding user response：

I believe that there isn’t one. Simply because there are so many types of algorithms and ways to parallel them that it always comes down to a certain implementation. Not only an algorithm but the CPU architecture(or even GPU if you are running it there) can also differ and affect the performance.

As a rule of thumb - you should have as many active workers(threads) as amount of logical cores available to you in case of CPU computations. In case of GPU computations - its difficult and you need to tailor your own solution what will suit your needs.

Another tip is not to split the task by amount of workers equally in big chunks but to divide them even smaller and keep them in a queue for workers to pick from. This allows to keep track of upcoming workload and also gives the ability to halt gracefully.

This tip I’ve discovered after learning a bit of Nvidias CUDA - there you can spawn almost any number of tasks(blocks) but a graphics card will only execute a fixed amount of them in parallel as the internal planner decides.