Is the "throughput" listed by Intel per thread or per core?-CodePudding

Is the throughput listed by the Intel intrinsics guide per thread or per core?

CodePudding user response：

It's per physical core.

SMT (hyperthreading) only helps overall throughput if you're bottlenecked on things other than back-end execution ports. If the threads sometimes stall on cache misses or branch misses, SMT can come closer to keeping execution units fed with a new uop to start every clock cycle, achieving that listed throughput limit. Having two instruction streams for out-of-order scheduling to pick from can avoid starvation (stalling) even if the thread on one logical core is stuck waiting for something.

Note that you can get better details about instruction timings from https://uops.info/, and about what the numbers mean from https://agner.org/ and/or Intel's optimization manuals.

"Throughput" of a single instruction doesn't tell you whether or not it competes with some other instruction. e.g. FMA with 0.5c throughput runs on different ports (p0 and p1) than shuffles with 1c throughput (p5) on Intel CPUs like Haswell and Skylake. (And Ice Lake if we're talking about shuffles that can't also run on the secondary shuffle unit.) That's why looking at back-end uops is more useful, how many uops and for which port.