memcpy significant performance differences when writting to buffer for multiple times-CodePudding

when using memcpy to write to a buffer for multiple times, I can see significant performance differences: writing to a specific address for the first time takes much longer than the second or further times. The observation is 100% reproducible.

I am wondering what would cause such significant performance differences?

See following code example (compileable on Windows):

#include <iostream>
#include <profileapi.h>

LARGE_INTEGER getTimeStamp(void)
{
    LARGE_INTEGER t;
    QueryPerformanceCounter(&t);
    return t;
}

unsigned int getElapsedMicroseconds(LARGE_INTEGER start)
{
    LARGE_INTEGER end = getTimeStamp();
    
    LARGE_INTEGER freq;
    QueryPerformanceFrequency(&freq);

    double t = (double)(end.QuadPart - start.QuadPart) * 1000000.0 / (double)freq.QuadPart;
    return (unsigned int)(t   0.5);
}

int main(int argc, char** argv)
{
    static const size_t singleBuffSize = 36 * 1024 * 1024;
    static const size_t nrOfBuffers = 6;

    unsigned char* srcBuff = new unsigned char[singleBuffSize];
    unsigned char* dstBuff = new unsigned char[singleBuffSize * nrOfBuffers];

    for (int i = 0; i < (nrOfBuffers*3); i  )
    {
        size_t buffIdx = (i % nrOfBuffers) * singleBuffSize;

        LARGE_INTEGER start = getTimeStamp();
        memcpy(&dstBuff[buffIdx], srcBuff, singleBuffSize);
        unsigned int elapsedMicroseconds = getElapsedMicroseconds(start);

        printf("Loop -: buffer nr %2lu, elapsed time = %6u microseconds\n", i 1, ((i % nrOfBuffers)   1), elapsedMicroseconds);
    }
    
    delete[] srcBuff;
    delete[] dstBuff;
    
    return 0;
}

Example result:

Loop  1: buffer nr  1, elapsed time =  76207 microseconds
Loop  2: buffer nr  2, elapsed time =  25552 microseconds
Loop  3: buffer nr  3, elapsed time =  24200 microseconds
Loop  4: buffer nr  4, elapsed time =  24036 microseconds
Loop  5: buffer nr  5, elapsed time =  28470 microseconds
Loop  6: buffer nr  6, elapsed time =  58528 microseconds
Loop  7: buffer nr  1, elapsed time =   6428 microseconds
Loop  8: buffer nr  2, elapsed time =   9324 microseconds
Loop  9: buffer nr  3, elapsed time =   9389 microseconds
Loop 10: buffer nr  4, elapsed time =   9434 microseconds
Loop 11: buffer nr  5, elapsed time =   9641 microseconds
Loop 12: buffer nr  6, elapsed time =   9953 microseconds
Loop 13: buffer nr  1, elapsed time =   9488 microseconds
Loop 14: buffer nr  2, elapsed time =   9834 microseconds
Loop 15: buffer nr  3, elapsed time =   6211 microseconds
Loop 16: buffer nr  4, elapsed time =   6282 microseconds
Loop 17: buffer nr  5, elapsed time =   5950 microseconds
Loop 18: buffer nr  6, elapsed time =   9570 microseconds

E.g. for buffer nr. 1, the first memcpy call takes much longer than subsequent calls

CodePudding user response：

Reading from L2 cache takes max 10 cycles while reading/writing to DDR (main memory) takes over 300 cycles in modern CPUs. It's 30 times slower.

As you iterate over your memory again and again, the pieces will fall more and more into the L2/L3 cache, speeding up execution.

Here's a good reading about this. The PDF is somewhat outdated but it is still 99% valid.

What Every Programmer Should Know About Memory

CodePudding user response：

The data size is 36MB for the source buffer and 216MB for the destination buffer. That’s way beyond the cache size of any computer I can afford to buy. This has nothing to do with caches.

The explanation is that it takes time to allocate memory to the process. The first copy allocates 72MB, the next five copies allocate 36MB each, so they are faster. From the seventh copy, no further memory is allocated, so from then on we copy several GB per second.

The two malloc’s just reserve space, they don’t allocate any memory yet.