I have wrote a basic three loop matrix multiplication using a serial and OpenMP implementations. For the same size (3200x3200), the perf stat -a -e instructions,cycles
shows:
Serial
265,755,992,060 instructions # 0.71 insn per cycle
375,319,584,656 cycles
85.923380841 seconds time elapsed
Parallel (16 threads)
264,524,937,733 instructions # 0.30 insn per cycle
883,342,095,910 cycles
13.381343295 seconds time elapsed
In the parallel run, I expect that the number of cycles roughly be the same as serial run. But it isn't. On the other hand, the number of instructions are the same. I expected that the instructions be 16x of the serial run.
Any thoughts for the differences?
CodePudding user response:
Assuming your process perfectly scale, the number of instructions will be shared amongst all cores and the total number of instructions will be the same. The same thing applies for the number of cycles. However, when your process does not scale, the number of instruction should be the same but the number of cycle increase. This is generally due to the contention of a shared resource that cause stall cycles in the pipeline.
In your parallel implementation, the memory hierarchy is not efficiently used since your parallel implementation cause a lot of cache misses that could saturate the L3 or the RAM when many threads are used, hence memory stalls. If you use simultaneous multithreading (aka Hyper-threading), this can also cause this problem because two threads on the same core often does not run truly in parallel (some part of the core are shared between threads).
CodePudding user response:
Constant instructions makes sense; there's the same amount of total work to do, whether those instructions all run on the same core or not.
Are you using SMT such as hyperthreading? IPC goes down when the same physical core divides its time between two logical cores. For some programs scaling, SMT increases overall throughput; for cache-bound programs (like a naive matmul) having 2 threads compete for the same L1/L2 cache hurts.
Otherwise, if this is 16 physical cores, you might be seeing that effect for L3 contention, when a single thread doesn't have all the L3 to itself.
Anyway, with proper cache-blocking, SMT usually hurts overall throughput for matmul / dense linear algebra. A physical core can be saturated with ALU work by one thread with well-tuned code, so contention for per-core caches just hurts. (But multi-threading definitely helps overall time, like it did in your case.) A cache-blocked matmul is usually 5 nested loops, as in the example near the end of What Every Programmer Should Know About Memory?
Agner Fog (https://agner.org/optimize/) also mentions hyperthreading hurting for some workloads in his microarch PDF.
CodePudding user response:
Matrix-matrix multiplication is the typical example of an operation that theoretically has lots of cache reuse, and so can run close to peak speed. except that the standard three-loop implementation doesn't. Optimized implementations that use the cache efficiently are actually six levels deep: each of your three loops gets tiled, and you need to interchange loops and set the tilings just right. This is not trivial.
So your implementation basically uses no cache. Maybe you can have some effects from the L3 cache, but, given a sufficient problem size, certainly not L1, and likely not L2. In other words: you are bandwidth-bound. And then the problem is that you may have 16 cores, but you don't have enough bandwidth to feed all of those.
I must say that your factor of 6--7 is a little disappointing, but maybe your architecture just doesn't have enough bandwidth. I know that for top-of-the-line nodes I would expect something like 12. But why don't you test it? Write a benchmark that does nothing but stream data to and from memory. Vector-vector addition. And then see how many cores you can get linear speed up.