Efficiency of using omp parallel for inside omp parallel omp task-CodePudding

I am trying to improve the efficiency of a complicated project using OpenMP. I will use a toy example to show the situation I am facing.

First, I have a recursion function and I use omp task to improve it, like this:

void CollectMsg(Node *n) {
#pragma omp task
{
    CollectMsg(n->child1);
    ProcessMsg1(n->array, n->child1->array);
}

#pragma omp task
{
    CollectMsg(n->child2);
    ProcessMsg1(n->array, n->child2->array);
}

#pragma omp taskwait
    ProcessMsg2(n->array);
}

void main() {

    #pragma omp parallel num_threads(2)
    {
#pragma omp single
        {
            CollectMsg(root);
        }
    }
}

Then, however, the function ProcessMsg1 contains some computations and I want to use omp parallel for on it. The following code shows a simple example. In this way, the project runs correctly, but the efficiency of it is just slightly improved. I am wondering the working mechanism of this process. Does it work if I use omp parallel for inside omp task? I have this question because there is only a slight difference on efficiency between using and not using omp parallel for in function ProcessMsg1. If the answer is yes, first I created 2 threads for omp task, and in the processing of each task, I created 2 threads for omp parallel for, then should I think of this parallel process as 2*2=4 threaded parallelism? If the answer is no, then how can I further improve the efficiency of this case?

void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
    for (int i = 0; i < size;   i) {
        a1[i] *= a2[i];
    }
}

CodePudding user response：

Nesting a parallel for inside a task will work, but will require you enable nested execution and will lead to all sorts of difficult issues to resolve, especially around load balancing and the actual number of threads being created.

I would recommend using the taskloop construct instead:

void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
    for (int i = 0; i < size;   i) {
        a1[i] *= a2[i];
    }
}

This will create additional tasks for to execute the loop. If ProcessMsg1 is executed from different tasks in the system, the tasks created by taskloop will mix automatically with all the other tasks that your code is creating.

Depending on how much work your code does per loop iteration, you may want to adjust the size of the tasks using the grainsize clause (I just put 100 to have an example). See the

Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks RAM-allocation-costs RAM-I/O O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )

This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).

For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns

For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speedups (see the simulator and related details)

For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes )

Tips for experimenting with the Costs-of-instantiation(s) :

we start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL] part, which is never real, yet lets use it )

- lets move the N_CPUcores-slider in the simulator to anything above 64-cores

- next move the O_verhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of N_CPUcores processes, as a number of percent, relative to the such [PARALLEL]-section part number of instructions - mathematically "dense" work has many such "usefull"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speedups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )

- next move the O_verhead-slider in the simulator to the rightmost edge == 1. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 % of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speedup-degradation - just play a bit with the O_verhead-slider in the simulator, before touching the others...

- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL], by design )

So,
besides straight errors,
besides performance anti-patterns,
there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get

refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
respect your physical platform constraints, ignoring them will degrade your performance
benchmark, profile, refactor
benchmark, profile, refactor
benchmark, profile, refactor no other magic wand available here

Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.