Home > Mobile >  Why does Clang add extra FMA instructions?
Why does Clang add extra FMA instructions?

Time:06-24

#include <immintrin.h>

__m256 mult(__m256 num) {
    return 278*num/(num 1400);
}
.LCPI0_0:
        .long   0x438b0000                      # float 278
.LCPI0_1:
        .long   0x44af0000                      # float 1400
mult(float __vector(8)):                           # @mult(float __vector(8))
        vbroadcastss    ymm1, dword ptr [rip   .LCPI0_0] # ymm1 = [2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2,2.78E 2]
        vmulps  ymm1, ymm0, ymm1
        vbroadcastss    ymm2, dword ptr [rip   .LCPI0_1] # ymm2 = [1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3,1.4E 3]
        vaddps  ymm0, ymm0, ymm2
        vrcpps  ymm2, ymm0
        vmulps  ymm3, ymm1, ymm2
        vfmsub213ps     ymm0, ymm3, ymm1        # ymm0 = (ymm3 * ymm0) - ymm1
        vfnmadd213ps    ymm0, ymm2, ymm3        # ymm0 = -(ymm2 * ymm0)   ymm3
        ret

Why does Clang add the two extra FMA instructions to the code? The result should already be computed with vmulps ymm3, ymm1, ymm2. Don't the extra instructions increase the latency beyond just using vdivps like with -O3?

Godbolt

CodePudding user response:

The extra FMAs compensate for the reduced precision of vrcpps. ymm3 is an estimate of the result, but at about half the usual precision.

For simplicity let's say the division was q = a / b.

The first FMA, vfmsub213ps, computes the difference (a * b⁻¹) * b - a, which is an estimate of how much the division was "off" by (in the original scale, before dividing by b). The second FMA approximately divides that difference by b (by multiplying by b⁻¹) so it becomes a difference in the scale of the q, and subtracts it from q to bring it closer to a / b.

If you're OK with reduced precision, you could explicitly use _mm256_rcp_ps and multiply by that, then there will not be extra FMAs to compensate.

Don't the extra instructions increase the latency beyond just using vdivps

Yes, this 4-instruction sequence would take 16 cycles on Ice Lake, while vdivps would take 11 cycles. However, throughput is approximately doubled compared to vdivps. Depending on the context, latency or throughput could be more important .. more often it's throughput. Compilers aren't necessarily very good at deciding which is more important, though in this case I can't blame it (there is no context).

  • Related