What is causing the threads to execute slower than the serial case?-CodePudding

I have a simple function which computes the sum of "n" numbers.

I am attempting to use threads to implement the sum in parallel. The code is as follows,

void Add(double &sum, const int startIndex, const int endIndex)
{
    sum = 0.0;
    for (int i = startIndex; i < endIndex; i  )
    {
        sum = sum   0.1;
    }    
}

int main()
{
    int n = 100'000'000;

    double sum1;
    double sum2;

    std::thread t1(Add, std::ref(sum1), 0, n / 2);
    std::thread t2(Add, std::ref(sum2), n / 2, n);

    t1.join();
    t2.join();

    std::cout << "sum: " << sum1   sum2 << std::endl;

    // double serialSum;
    // Add(serialSum, 0, n);
    // std::cout << "sum: " << serialSum << std::endl;

    return 0;
}

However, the code runs much slower than the serial version. If I modify the function such that it does not take in the sum variable, then I obtain the desired speed-up (nearly 2x).

I read several resources online but all seem to suggest that variables must not be accessed by multiple threads. I do not understand why that would be the case for this example.

Could someone please clarify my mistake?.

CodePudding user response：

The problem here is hardware.

You probably know that CPUs have caches to speed up operations. These caches are many times faster then memory but they work in units called cachelines. Probably 64 byte on your system. Your 2 doubles are each 8 byte large and near certainly will end up being in the same 64 byte region on the stack. And each core in a cpu generally have their own L1 cache while larger caches may be shared between cores.

Now when one thread accesses sum1 the core will load the relevant cacheline into the cache. When the second thread accesses sum2 the other core attempts to load the same cacheline into it's own cache. And the x86 architecture is so nice trying to help you it will ask the first cache to hand over the cacheline so both threads always see the same data.

So while you have 2 separate variables they are in the same cache line and on every access that cacheline bounces from one core to the other and back. Which is a rather slow operation. This is called false sharing.

So you need to put some separation between sum1 and sum2 to make this work fast. See std::hardware_destructive_interference_size for what distance you need to achieve.

Another, and probably way simpler, way is to modify the worker function to use local variables:

void Add(double &sum, const int startIndex, const int endIndex)
{
    double t = 0.0;
    for (int i = startIndex; i < endIndex; i  )
    {
        t = t   0.1;
    }    
    sum = t;        
}

You still have false sharing and the two threads will fight over access to sum1 and sum2. But now it only happens once and becomes irrelevant.