I have a program which accesses single bytes in a large array at random. Since this array exceeds the L2 cache, it requires many queries to RAM. I created a benchmark which emulates this program by generating random numbers and querying a large array. It seems like the benchmark is under-performing relative to my RAM's advertised speed.
I have DDR4-2933 RAM which is supposed to handle 2933 MT/s. The maximum transfer rate I have been able to achieve is 56.8 MT/s. What would be the bottleneck preventing faster execution? If I had to speculate, I might say the CPU could only reorder instructions within some fixed window and this would limit the parallelization of the fetches. Although, I have no evidence beyond the benchmarks.
Benchmark & Methodology
I created a short program which populates a large array with random numbers. Then, it loads values at random offsets from the array and XORs them together. Memory accesses should be reorderable.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[4];
uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 3) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i ) {
buf[i] = next();
}
clock_t start = clock();
double cpu_time_used;
uint8_t val = 0;
for(size_t i = 0; i < calls; i ) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
cpu_time_used = ((double) (clock() - start)) / CLOCKS_PER_SEC;
printf("time: %.3f\n", cpu_time_used);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", cpu_time_used / calls);
printf("MT / sec: %.3e\n", calls / cpu_time_used / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
All benchmarks were executed on a Intel(R) Core(TM) i9-10885H CPU running at 2.40 GHz with CPU scaling disabled. The program was compiled with gcc -O3 main.c -o main -g -O3 -flto
and called with nice -n -2 ./main 31 1000000000
for an array of size 2^31 and 1e9 accesses. The number of transactions per second issued to RAM can be approximated by the number of bytes fetched since virtually all of the lookups result in cache missed. I tested this using perf
and it seemed like around 95% of cache lookups resulted in misses.
The random number generator occupies about 7.1% of program overhead. This was measured by removing the line val ^= buf[idx];
which executes the fetch. perf
reported around 99% of memory overhead was due to last-level cache misses.
Follow-up Benchmarks
Loop unrolling:
By default, GCC 12.2.0 did not unroll the loop. It took some cajoling. I replaced:
for(size_t i = 0; i < calls; i ) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
with:
for(size_t i = 0; i < batches; i ) {
#pragma GCC unroll 3
for(size_t j = 0; j < 8; j ) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
}
I look five timings at 2.0 GHz
. The version without loop unrolling seemed to perform better but not significantly. I'm not very surprised since the branch would seem highly predictable.
no unrolling unrolling
0 23.249 24.287
1 24.044 24.520
2 23.455 24.358
3 23.233 24.167
4 22.969 23.833
Multi-processing
I implemented a threaded version of the benchmark. It evenly divides the accesses among the threads. I also switch to measuring wall clock time. Here are the measurements (taken at 2.4 GHz):
threads time MT/s
0 1 17.345 57.653502
1 2 8.949 111.744329
2 4 5.022 199.123855
3 8 3.533 283.045570
4 16 3.203 312.207306
5 32 3.200 312.500000
6 64 3.173 315.159155
7 124 3.165 315.955766
Modified program:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
#include <threads.h>
#include <pthread.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static thread_local uint64_t s[4];
static uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
struct xor_args {
size_t iters;
uint8_t * buf;
uint8_t res;
uint64_t smask;
pthread_t id;
};
#include <sys/random.h>
void * xor_worker(struct xor_args * a) {
if(-1 == getrandom(&s, sizeof(s), GRND_RANDOM)) {
fprintf(stderr, "getrandom() failed\n");
exit(1);
}
uint8_t val = 0;
for(size_t i = 0; i < a->iters; i ) {
uint64_t idx = next() & a->smask;
val ^= a->buf[idx];
}
a->res = val;
return NULL;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 4) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs] [threads]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
int threads = strtol(argv[3], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i ) {
buf[i] = next();
}
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
struct xor_args * args = malloc(sizeof(struct xor_args) * threads);
int s;
for(int i = 0; i < threads; i ) {
struct xor_args * arg = &args[i];
arg->iters = calls / (uint64_t) threads;
if(i == 0) {
args->iters = calls % threads;
}
arg->smask = smask;
arg->buf = buf;
s = pthread_create(&arg->id, NULL, (void * (*)(void *)) xor_worker, arg);
if(s != 0) {
fprintf(stderr, "pthread_create() failed");
exit(1);
}
}
uint8_t val = 0;
for(int i = 0; i < threads; i ) {
struct xor_args * arg = &args[i];
s = pthread_join(arg->id, NULL);
if(s != 0) {
fprintf(stderr, "pthread_join() failed");
exit(1);
}
val ^= arg->res;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed = (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("time: %.3f\n", elapsed);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", elapsed / calls);
printf("M T / sec: %.3e\n", calls / elapsed / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
CodePudding user response:
The program is bound by the memory hierarchy. Indeed, on modern processor, accessing 1 byte from the RAM causes the whole cache line to be fetched from the RAM. This means 64 bytes on your processor. One channel of DDR4-2933 RAM can reach a theoretical bandwidth of 8*2933e6/1024**3 = 21.85 GiB/s
. However, regarding the fact that the cache is useless here and that 64 bytes are fetched, the maximum throughput is 21.85/64*1024 = 350 MiB/s
. Based on this, the program should at least take 1000000000/(350*1024**2) = 2.72 s
.
In practice, no modern processor can fully saturate a DDR4 memory (especially using only 1 core). Modern RAM are designed to be fast mainly for contiguous access, not random one, despite the name. This is because speeding up random accesses is very hard and reading contiguous data is frequent (and far simpler to optimize at the hardware level).
The main problem with DDR RAM is its latency which is huge since several decades (50-120 ns per cache line fetch). It has not changed much since the last decade while processors have become significantly faster. As a result, it is critical to be able to mitigate this latency by sending a lot of simultaneous requests, that is increasing the concurrency. However, the execution flow is sequential and the number of simultaneous in-flight requests that 1 core can achieve is limited. Modern processors can execution few instructions in parallel from a sequential program as long as the instructions to be executed are independent. The thing is the program is pretty sequential due to the random number generator though multiple instructions can still be executed in parallel.
Multiple read requests can be done concurrently, but the size of the buffers to receive data from the RAM and the caches is limited. Intel processor units dedicated for that is the line-fill buffer (LFB) between the L1 and L2 cache, the super-queue between the L2 and the L3, and the integrated memory controller (iMC) between the L3 and the RAM. You can use perf
to check which one is the bottleneck. AFAIK, the LFB has about 12-14 slots on your target processor so it should not be a bottleneck here. The super-queue can read 32-bytes/cycle, that is, 1 cache-line every 2 cycle, or 1 byte of buf
every 2 cycles. That being said the L3 latency is pretty big: at least about a dozen of nanoseconds. In general, prefetching units are used to mitigate the latency but random accesses are unpredictable so hardware prefetchers are completely useless in this case.
Virtual memory make random accesses slower. Indeed, when a cache-line is fetched, the processor needs to translate a virtual address to a physical one. This is done by the Translation Lookaside Buffer (TLB) unit of a processor core. It is basically a cache able to translate the address of a virtual memory page to a physical one. There are multiple TLB units for different caches and the size of the cache is dependent of the size of the pages. Pages are typically 4K ones on your processor unless your OS decides to use huge pages in this specific case. Assuming its does not, the biggest TLB (STLB) has 1536 entries for 4K pages. This means it cannot be efficient for random accesses done on a 1536*4K = 6 MiB
buffer. Beyond this limit, the number of TLB miss will be frequent. When a TLB miss happen, the processor needs to call special kernel functions so to know how to translate the virtual addresses based on kernel data structures. This operation is very expensive (like any kernel calls in general) so it introduces a significant additional latency. If huge pages are used, then the threshold is 3 GiB (for pages of 2 MiB) or even 16 GiB (for 1 GiB pages). Using huge pages can thus significantly speed things up compared to basic 4K pages.
Assuming the processor has large enough buffers and TLB caches so not to limit the concurrency, the DDR4-SDRAM are typically the main bottleneck. Indeed, it does not only has a huge latency, but it is also optimized for contiguous accesses. Indeed, such RAM devices are split in banks so to reduce the device latency. Contiguous memory requests can be efficiently managed by multiple banks, but random requests are often significantly less efficient due to bank-conflict: when 2 requests are assigned to the same banks, they are processed serially rather than concurrently. While random requests tends to to be spread amongst banks, this load-balancing is not as good as contiguous request resulting in a bit lower throughout.
In the end, the speed of the program is certainly latency-bound and the time taken is calls * latency / concurrency
. It looks like latency / concurrency
is 10-15 ns in your case.
One solution is to speed up a bit this program is to use prefetching instructions. It often help to hide the latency by telling the processor to prefetch data in the caches (or temporary buffers) early so subsequent accesses are faster since data are supposed to be closer from computing units. This optimization is brittle : data should be prefetched early enough (otherwise the processor will stall waiting due to cache misses) and data should be kept in the target caches (not be evicted by newer prefetching instructions). In your case, this is not so easy since the index is random. Prefetching relatively large chunks so to then read the values faster should help a bit. It should be slower for small arrays and faster for big ones. Note that software prefetching is not a silver-bullet : it is generally not as good as hardware prefetching (due to the limited hardware concurrency).
An alternative solution is generally to use multiple cores (thanks to multiple threads). That being said, this is not easy too here due to the (inherently sequential) random number generator (RNG) unless it is Ok to use multiple RNG. In this case, fetching full cache lines is generally the main bottleneck due to the high concurrency, the small RAM bandwidth and mainly due to the poor efficiency (1 byte used over 64). Unfortunately, there is currently no way to efficiently extract single bytes from a DDR4-SDRAM using any mainstream x86-64 processor (including yours).
For more information, please read: