Why a CPU can't do FLOPs performance equivalent to it's Ghz in a simple loop?-CodePudding

I was wondering why a simple loop such as this one can't hit my CPU clock speed (4,2Ghz):

float sum = 0;    
for (int i = 0; i < 1000000; i =1) {
    sum = sum * 1   1;
}

Intuitively I would expect to achieve this in less than 1ms (like 0,238ms), doing 4.2 billion iteration per second. But I get about 3ms, which is about 333 million iterations per second.

I assume doing the math is 2 cycles, one for the multiplication and another for the sum. So let's say I'm doing 666 million operations... still seems slow. Then I assumed that the loop comparison takes a cycle and the loop counter takes another cycle...

So I created the following code to remove the loop...

void listOfSums() {
    float internalSum = 0;
    internalSum = internalSum * 1   1;
    internalSum = internalSum * 1   1;
    internalSum = internalSum * 1   1;
    internalSum = internalSum * 1   1;
    // Repeated 100k times

To my surprise it's slower, now this takes 10ms. Leading to only 100 million iterations per second.

Given that modern CPU use pipelining, out of order execution, branch prediction... it seems that I'm unable to saturate the 4,2Ghz speed by just doing two floating point operations inside a loop.

Is it safe to then assume that the 4,2Ghz is only achievable with SIMD to fully saturate the CPU core with tasks and doing a simple loop will get you about 1/6 the Ghz floating point performance? I've tried different processors and 1/6 seems to be in the ballpark (Intel, iPhone, iPad).

What's exactly the bottleneck? The CPU ability to parse the instruction? Which only can be surpassed with SIMD?

CodePudding user response：

It is typical that a current processor can issue one or more floating-point additions in each processor cycle and can issue one or more floating-point multiplications in each cycle. It is also typical that a floating-point addition or multiplication will take four cycles to complete. This means, once you have started four floating-point additions—one in cycle n, one in cycle n 1, one in cycle n 2, and one in cycle n 3—the processor may be completing one addition per cycle—one in cycle n 4 (while a new one also starts in cycle n 4), one in n 5, and so on.

However, in order to start a floating-point operation, the inputs to that operation must be ready. Once sum * 1 has been started in cycle n, its result might not be ready until cycle n 4. So the addition of 1 will start in cycle n 4. And that addition will not complete until cycle n 8. And then the multiplication in the next iteration that uses that result cannot start until cycle n 8. Thus, with the nominal loop structure shown, one floating-point addition or multiplication will be completed every four cycles.

If instead you try:

float sum0 = 0;
float sum1 = 0;
float sum2 = 0;
float sum3 = 0;    
for (int i = 0; i < 1000000; i  = 1)
{
    sum0 = sum0 * 1   1;
    sum1 = sum1 * 1   1;
    sum2 = sum2 * 1   1;
    sum3 = sum3 * 1   1;
}

then you may find four times as many floating-point operations are completed in the same time.

These details vary from processor model to processor model. Some processors might start working on certain instructions before all of their inputs are ready, some might offer early forwarding of results directly to other instruction execution units before the result is delivered to the regular result register, and so on, so the obtained performance is hugely dependent on processor characteristics.

Regarding the listOfSums example, the code grossly exceeds the size of L1 cache, and therefore the processor must load each instruction from memory before executing it, which greatly reduces the performance.