For example, is addition faster than subtraction faster than multiplication faster than division faster than taking the modulus, etc? What about things like the cosine, the tangent, or logarithm?
I'm not looking for a specific use case, but rather in general. Something like, suppose my use case is "if my input is approximately 300 and I want to get 3, is it faster to ..." (the exact result doesn't matter, as long as the order of magnitude is right). Are there any operations I can immediately reject as inefficient? For example, can I say 300 * 0.01 is better than 300 % 5 before I run the timings?
This question seems to say that multiplication is usually better than division (although the compiler ought to handle it for you), but doesn't say anything about the other elementary operations.
If it matters the language I'm interested in is C .
CodePudding user response:
No. This is dependent of the target processor architecture although some operation tends to be slower than others on nearly all modern processor (eg. division vs addition). This is also dependent of the code itself since some instructions can well pipelined while some other are not (dependent of the target processor architecture). In fact, the compiler also matters since the order of the instructions, the dependencies and the use of SIMD instructions can also strongly impact performance results. The type is also very important: an operation can be fast on a given type and slow with another one (compared to another operation for the same types). Finally, for trigonometric function, the libmath implementation and its version also matters a lot (regarding compiler flags too).
One basic example is this naive basic loop:
// data is of type TYPE* and size is of type size_t
TYPE sum = 0;
for(size_t i=0 ; i<size ; i)
sum = data[i];
If TYPE
is an integral type, the compiler can automatically vectorize the loop. The operation will likely be memory bound. If TYPE
is a floating-point type, then the operation will be much slower (at least an order of magnitude slower) unless some fastmath flags are used. This is because the compiler cannot automatically vectorize the code and there is a loop carried dependency on sum
causing the instruction latency to dominate the time of the loop. This latency can change from one processor to another. For example, adding two FP numbers takes a latency of 4 cycles on Intel Skylake while 2 instructions can be computed per cycle (operating on 4 items at once). On Sandy Bridge, the latency is 3 and 1 instruction can be computed per cycle. Thus, a Sandy Bridge should get faster results than Skylake without fastmath and the opposite with this option (assuming the processors operate at the same frequency).
Timings cannot be naively added: modern processors are able to execute many instruction in parallel per cycle in an out-of-order way. Thus, a slow square root added to a loop doing many latency bound instruction may not impact the overall timings (it happened me once). An ADD MUL can be rewritten as one FMADD which can be faster and an ADD MUL. A modulus division are computed using the same instruction on x86-64 processors so computing both is not slower than computing only a division/modulus.
The division is also a good example: a 64-bit integer division by a constant is very fast but a division by a variable is generally slow. A 32-bit division can be much faster than a 64-bit one (regarding the processor). A 64-bit floating-point division can be faster than a 64-bit integer one on some processors. Division was generally very slow on old processors but there are quite fast on new processors like the Apple M1 and a bit on Intel Alder Lake too. For example, a 64-bit idiv
on Skylake has a latency of 37-101 cycle and a throughput of 14.5-24 cycles, a 64-bit FP number has a latency of 13-15 cycle and a throughput of 1-4 cycles. On Zen3, it is respectively, 10-22 cycle, 7 cycle, 13 cycle and 0.5-4.5 cycle. A reciprocal can be computed even faster on most mainstream x86-64 platforms: On Skylake it has a 4 cycle latency and 1 cycle throughput while it is respectively 3 cycle and 0.5 cycle on Zen3 (same latency than an integer multiplication and better throughput).
For trigonometric functions, switching from the GNU implementation to the Intel one can results in drastically different results. Not only the Intel libmath tends to be significantly faster, but the relative performance between trigonometric function also change (since the implementation is completely different). New version of GNU libmath tends to vectorize code resulting in better performance for vectorized code for some specific functions.
An addition on subnormal floating-point number can be much slower an a division on normal floating-point number even though a division is slower than an addition on normal floating-point numbers since subnormal number tends to be very slow to compute. This factor is often more important than the actual operation being computed.
On embedded processors (like the ARM Cortex-M1, not to be confused with the Apple M1), floating-point operations are generally insanely slow compared to integer-based ones since such operation are not done by the hardware directly but manually in software. One should really avoid such operations on embedded systems.
Finally, some processors like Xeon Phi KNL implementing AVX-512-ER can compute exponential instructions quickly: in only 10 cycle (7 cycle for the throughput). This is significantly faster than a 64-bit division!
To sum up: every details matter. I think it is good to check the instruction table of the target processor and care about the generated code of the compiler. Still, in general people tends to assume the following is true:
(add, sub, xor, or, and, mul, shift) < (division, modulus, sqrt) < (exp, cos, tan, atan, log)