Home > Back-end >  Why does isolating tasks in task arenas to NUMA nodes for memory locality slow down my embarassingly
Why does isolating tasks in task arenas to NUMA nodes for memory locality slow down my embarassingly

Time:02-16

I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.

#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>



int main(){
    
    std::vector<int> numa_indexes = tbb::info::numa_nodes();
    std::vector<tbb::task_arena> arenas(numa_indexes.size());
    std::size_t numa_nodes = numa_indexes.size();
    for(unsigned j = 0; j < numa_indexes.size(); j  ){
        arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
    }

    std::size_t size = 10000000;
    std::size_t part_size = std::ceil((float)size/numa_nodes);
    double * A = (double *) malloc(sizeof(double)*size);
    double * B = (double *) malloc(sizeof(double)*size);
    double * C = (double *) malloc(sizeof(double)*size);
    double * D = (double *) malloc(sizeof(double)*size);


    //DATA INITIALIZATION
    for(unsigned k = 0; k < numa_indexes.size(); k  )
        arenas[k].execute(
        [&](){
            std::size_t local_start = k*part_size;
            std::size_t local_end = std::min(local_start   part_size, size);
            tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
                [&](std::size_t i)
            { 
                C[i] = D[i] = 0;
                A[i] = B[i] = 1;
            }, tbb::static_partitioner());
        });

    //PARALLEL ALGORITHM
    tbb::tick_count t0 = tbb::tick_count::now();
    for(int i = 0; i<100; i  )
        tbb::parallel_for(static_cast<std::size_t>(0), size,
            [&](std::size_t i)
            { 
                C[i]  = A[i]   B[i];
            }, tbb::static_partitioner());
    tbb::tick_count t1 = tbb::tick_count::now();
    std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
        

    //TASK ARENA & PARALLEL ALGORITHM
    t0 = tbb::tick_count::now();
    for(int i = 0; i<100; i  ){
        for(unsigned k = 0; k < numa_indexes.size(); k  ){
        arenas[k].execute(
        [&](){
            for(unsigned i=0; i<numa_indexes.size(); i  )
                task_groups[i].wait();
            task_groups[k].run([&](){
                std::size_t local_start = k*part_size;
                std::size_t local_end = std::min(local_start   part_size, size);
                tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
                    [&](std::size_t i)
                    { 
                        D[i]  = A[i]   B[i];
                    });
            });

        });
    }

    t1 = tbb::tick_count::now();
    std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;

    double sum1 = 0;
    double sum2 = 0;
    for(int i = 0; i<size; i  ){
        sum1  = C[i];
        sum2  = D[i];
    }

    std::cout << sum1 << std::endl;
    std::cout << sum2 << std::endl;

    
    return 0;
}

Performance with:

for(unsigned j = 0; j < numa_indexes.size(); j  ){
        arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
    }
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e 07
2e 07

Performance without constraints:

$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e 07
2e 07

EDIT: I implemented the use of task_group as found in @AlekseiFedotov's suggested resources, but the issue still remains.

CodePudding user response:

Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.

Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.

On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.

That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.

BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.

EDIT answer for the EDITED question:

  1. See the comment to the question.
  2. The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.

Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.

  • Related