#include <immintrin.h>
__m256 mult(__m256 num) {
return 278*num/(num 1400);
}
.LCPI0_0:
.long 0x438b0000 # float 278
.LCPI0_1:
.long 0x44af0000 # float 1400
mult(float __vector(8)): # @mult(float __vector(8))
vbroadcastss ymm1, dword ptr [rip .LCPI0_0] # ymm1 = [2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2]
vmulps ymm1, ymm0, ymm1
vbroadcastss ymm2, dword ptr [rip .LCPI0_1] # ymm2 = [1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3]
vaddps ymm0, ymm0, ymm2
vrcpps ymm2, ymm0
vmulps ymm3, ymm1, ymm2
vfmsub213ps ymm0, ymm3, ymm1 # ymm0 = (ymm3 * ymm0) - ymm1
vfnmadd213ps ymm0, ymm2, ymm3 # ymm0 = -(ymm2 * ymm0) ymm3
ret
Why does Clang add the two extra FMA instructions to the code? The result should already be computed with vmulps ymm3, ymm1, ymm2
. Don't the extra instructions increase the latency beyond just using vdivps
like with -O3
?
CodePudding user response:
The extra FMAs compensate for the reduced precision of vrcpps
. ymm3
is an estimate of the result, but at about half the usual precision.
For simplicity let's say the division was q = a / b
.
The first FMA, vfmsub213ps
, computes the difference (a * b⁻¹) * b - a
, which is an estimate of how much the division was "off" by (in the original scale, before dividing by b
). The second FMA approximately divides that difference by b
(by multiplying by b⁻¹
) so it becomes a difference in the scale of the q
, and subtracts it from q
to bring it closer to a / b
.
If you're OK with reduced precision, you could explicitly use _mm256_rcp_ps
and multiply by that, then there will not be extra FMAs to compensate.
Don't the extra instructions increase the latency beyond just using
vdivps
Yes, this 4-instruction sequence would take 16 cycles on Ice Lake, while vdivps
would take 11 cycles. However, throughput is approximately doubled compared to vdivps
. Depending on the context, latency or throughput could be more important .. more often it's throughput. Compilers aren't necessarily very good at deciding which is more important, though in this case I can't blame it (there is no context).