Home > Net >  How to properly synchronize threads at barriers
How to properly synchronize threads at barriers

Time:09-17

I am encountering an issue where I have a hard time telling which synchronization primitive I should use.

I am creating n parallel threads that work on a region of memory, each is assigned to a specific part of this region and can accomplish its task independently from the other ones. At some point tho I need to collect the result of the work of all the threads, which is a good case for using barriers, this is what I'm doing.

I must use one of the n worker threads to collect the result of all their work, for this I have the following code that follows the computation code in my thread function:

if (pthread_barrier_wait(thread_args->barrier)) {
   // Only gets called on the last thread that goes through the barrier
   // This is where I want to collect the results of the worker threads
}

So far so good, but now is where I get stuck: the code above is in a loop as I want the threads to accomplish work again for a certain number of loop spins. The idea is that each time pthread_barrier_wait unblocks it means all threads have finished their work and the next iteration of the loop / parallel work can start again.

The problem with this is that the result collector block statements are not guaranteed to execute before other threads start working on this region again, so there is a race condition. I am thinking of using a UNIX condition variable like this:

// This code is placed in the thread entry point function, inside
// a loop that also contains the code doing the parallel
// processing code.

if (pthread_barrier_wait(thread_args->barrier)) {
    // We lock the mutex
    pthread_mutex_lock(thread_args->mutex);
    collectAllWork(); // We process the work from all threads
    // Set ready to 1
    thread_args->ready = 1;
    // We broadcast the condition variable and check it was successful
    if (pthread_cond_broadcast(thread_args->cond)) {
        printf("Error while broadcasting\n");
        exit(1);
    }
    // We unlock the mutex
    pthread_mutex_unlock(thread_args->mutex);
} else {
    // Wait until the other thread has finished its work so
    // we can start working again
    pthread_mutex_lock(thread_args->mutex);
    while (thread_args->ready == 0) {
        pthread_cond_wait(thread_args->cond, thread_args->mutex);
    }
    pthread_mutex_unlock(thread_args->mutex);
}

There is multiple issues with this:

  • For some reason pthread_cond_broadcast never unlocks any other thread waiting on pthread_cond_wait, I have no idea why.
  • What happens if a thread pthread_cond_waits after the collector thread has broadcasted? I believe while (thread_args->ready == 0) and thread_args->ready = 1 prevents this, but then see next point...
  • On the next loop spin, ready will still be set to 1 hence no thread will call pthread_cond_wait again. I don't see any place where to properly set ready back to 0: if I do it in the else block after pthread_cond_wait, there is the possibility that another thread that wasn't cond waiting yet reads 1 and starts waiting even if I already broadcasted from the if block.

Note I am required to use barriers for this.

How can I solve this issue?

CodePudding user response:

You could use two barriers (work and collector):

while (true) {

    //do work

    //every thread waits until the last thread has finished its work
    if (pthread_barrier_wait(thread_args->work_barrier)) {
        //only one gets through, then does the collecting
        collectAllWork();
    }

    //every thread will wait until the collector has reached this point
    pthread_barrier_wait(thread_args->collect_barrier);

}

CodePudding user response:

You could use a kind of double buffering.

Each worker would have two storage slots for results. Between the barriers the workers would store their results to one slot while the collector would read results from the other slot.

This approach has a few advantages:

  • no extra barriers
  • no condition queues
  • no locking
  • slot identifier does not even have to be atomic because each thread could have it's own copy of it and toggle it whenever reaching a barrier
  • much more performant as workers can work when collector is processing the other slot

Exemplary workflow:

Iteration 1.

  • workers write to slot 0
  • collector does nothing because no data is ready
  • all wait for barrier

Iteration 2.

  • worker write to slot 1
  • collector reads from slot 0
  • all wait for barrier

Iteration 3.

  • workers write to slot 0
  • collector reads from slot 1
  • all wait for barrier

Iteration 4.

  • go to iteration 2
  • Related