Why does using multiple threads result in slower execution?-CodePudding

I am using MacBook Air M1 2020, Apple M1 7-Core GPU, RAM 8GB.

The problem: I am comparing pairs of arrays which takes around 11 minutes when executed sequentially. Strangely, the more threads I put to work, the more time it takes to finish(even with NOT using mutex). I've so far tried to run it with 2 and then 4 threads.

What could be the problem? I assumed using 4 threads would be more efficient since I have available 7 cores and the execution time seems(to me) to be long enough to compensate for the overhead caused by handling multiple threads.

This is part of the code which I find relevant to this question:

int const xylen = 1024;
static uint8_t voxelGroups[321536][xylen];
int threadCount = 4;

bool areVoxelGroupsIdentical(uint8_t firstArray[xylen], uint8_t secondArray[xylen]){
    return memcmp(firstArray, secondArray, xylen*sizeof(uint8_t)) == 0;
}

void* getIdenticalVoxelGroupsCount(void* threadNumber){

    for(int i = (int)threadNumber-1; i < 321536-1; i  = threadCount){
        for(int j = i 1; j < 321536; j  ){
            if(areVoxelGroupsIdentical(voxelGroups[i], voxelGroups[j])){
                pthread_mutex_lock(&mutex);
                identicalVoxelGroupsCount  ;
                pthread_mutex_unlock(&mutex);
             }
        }
    }
    return 0;
}

int main(){
    // some code
    pthread_create(&thread1, NULL, getIdenticalVoxelGroupsCount, (void *)1);
    pthread_create(&thread2, NULL, getIdenticalVoxelGroupsCount, (void *)2);
    pthread_create(&thread3, NULL, getIdenticalVoxelGroupsCount, (void *)3);
    pthread_create(&thread4, NULL, getIdenticalVoxelGroupsCount, (void *)4);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
    pthread_join(thread3, NULL);
    pthread_join(thread4, NULL);
    // some code
}

CodePudding user response：

First of all, the lock serialize all the identicalVoxelGroupsCount increments. Using more thread will not speed up this part. On the contrary, it will be slower because cache-line bouncing: the cache lines containing the lock and the increment variable will move from one core to another serially (see: cache coherence protocols). This is generally much slower than doing all the work sequentially because moving a cache line from one core to another introduce a pretty big latency. You do not need a lock. You can instead increment a local variable and then perform a final reduction only once (eg. by updating an atomic variable in the end of getIdenticalVoxelGroupsCount).

Moreover, the interleaving of the loop iterations is not efficient because most of the cache lines containing voxelGroups will be shared between thread. This this not as critical as the first point because threads are only reading the cache lines. Still, this can increase the memory throughput and result in a bottleneck. A much more efficient approach is to split the iterations in relatively-large contiguous chunks. It could be even better to split the blocks in medium-grained tiles to use the cache more efficiently (although this optimization is independent of the parallelization strategy).

Note that you can use OpenMP to do such kind of operation easily and efficiently in C.