I am trying to optimize some code for speed and its spending a lot of time doing memcpys. I decided to write a simple test program to measure memcpy on its own to see how fast my memory transfers are and they seem very slow to me. I am wondering what might cause this. Here is my test code:
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#define MEMBYTES 1000000000
int main() {
clock_t begin, end;
double time_spent[2];
int i;
// Allocate memory
float *src = malloc(MEMBYTES);
float *dst = malloc(MEMBYTES);
// Fill the src array with some numbers
begin = clock();
for(i=0;i<250000000;i )
src[i]=(float) i;
end = clock();
time_spent[0] = (double)(end - begin) / CLOCKS_PER_SEC;
// Do the memcpy
begin = clock();
memcpy(dst, src, MEMBYTES);
end = clock();
time_spent[1] = (double)(end - begin) / CLOCKS_PER_SEC;
//Print results
printf("Time spent in fill: %1.10f\n", time_spent[0]);
printf("Time spent in memcpy: %1.10f\n", time_spent[1]);
printf("dst[200]: %f\n", dst[400]);
printf("dst[200000000]: %f\n", dst[200000000]);
//Free memory
free(src);
free(dst);
}
/*
gcc -O3 -o mct memcpy_test.c
*/
When I run this, I get the following output:
Time spent in fill: 0.4263950000
Time spent in memcpy: 0.6350150000
dst[200]: 400.000000
dst[200000000]: 200000000.000000
I think the theoretical memory bandwith for modern machines is tens of GB/s or perhaps over 100 GB/s. I know in practice one cannot expect to hit the theoretical limits, and that for large memory transfers things can be slow, but I have seen people reporting measured speeds for large transfers of ~20GB/s (e.g. here). My results suggest I am getting 3.14GB/s (edit: I originally had 1.57, but stark pointed out in a comment that I need to count both read and write). I am wondering if anyone has ideas that might help or ideas of why the performance I am seeing is so low.
My machine has two CPUS with 12 physical cores each (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) There is 192GB of RAM (I believe its 12x16GB DDR4-2666) The OS is Ubuntu 16.04.6 LTS
My compiler is: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Update
Thanks to all the valuable feedback I am now using a threaded implementation and getting much better performance. Thank you!
I had tried threading before posting with poor results (I thought), but as pointed out below I should have ensured I was using wall time. Now my results with 24 threads are as follows:
Time spent in fill: 0.4229530000
Time spent in memcpy (clock): 1.2897100000
Time spent in memcpy (gettimeofday): 0.0589750000
I am also using asmlib's A_memcpy with a large SetMemcpyCacheLimit value.
CodePudding user response:
Saturating RAM is not as simple as is seems.
First of all, at first glance here is the apparent throughput we can compute from the provided numbers:
- Fill:
1 / 0.4263950000 = 2.34
GB/s (1 GB is read); - Memcpy:
2 / 0.6350150000 = 3.15
GB/s (1 GB is read and 1 GB is written).
The thing is that the pages allocated by malloc
are not mapped in physical memory on Linux systems. Indeed, malloc
reserve some space in virtual memory, but the pages are only mapped in physical memory when a first touch is performed causing expensive page faults. AFAIK, the only way speed up this process is to use multiple cores or to prefill the buffers and reuse them later.
Additionally, due to architectural limitations (ie. latency), one core of a Xeon processor cannot saturate the RAM. Again, the only way to fix that is to use multiple cores.
If you try to use multiple core, then the result provided by the benchmark will be surprising since clock
does not measure the wall-clock time but the CPU time (which is the sum of the time spent in all threads). You need to use another function. In C, you can use gettimeofday
(which is not perfect as it is not monotonic) but certainly good-enough for your benchmark (related post: How can I measure CPU time and wall clock time on both Linux/Windows?). In C , you should use std::steady_clock
(which is monotonic as opposed to std::system_clock
).
In addition, the write-allocate cache policy on x86-64 platform force cache lines to be read when they are written. This means that to write 1 GB, you actually need to read 1 GB! That being said, x86-64 processors provide non-temporal store instructions that does not cause this issue (assuming your array is aligned properly and big enough). Compilers can use that but GCC and Clang generally does not. memcpy
is already optimized to use non-temporal stores on most machines. For more information, please read How do non temporal instructions work?.
Finally, you can parallelize the benchmark easily using OpenMP with simple #pragma omp parallel for
directives on loops. Note that is also provide a user-friendly function for computing the wall-clock time correctly: omp_get_wtime. For the memcpy
, the best is certainly to write a loop doing memcpy
by (relatively big) chunks in parallel.
For more information about this subject, I advise you to read the great famous document: What Every Programmer Should Know About Memory. Since the document is a bit old, you can check the updating information about this here. The document also describe additional important things to understand why you may still not succeed saturate the RAM with the above information. One critical topic is NUMA.