Is vfmadd132pd slow on AMD Zen 3 architecture?-CodePudding

I've created two versions of a dot product in .NET using AVX-256 instructions. One uses fused multiply add, and the other separated out into a multiply and and add.

public static unsafe Vector256<double> Dot(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i  = 4)
        vresult = Avx.Add(Avx.Multiply(Avx.LoadVector256(x   i), Avx.LoadVector256(y   i)), vresult);

    return vresult;
}

public static unsafe Vector256<double> Dot2(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i  = 4)
        vresult = Fma.MultiplyAdd(Avx.LoadVector256(x   i), Avx.LoadVector256(y   i), vresult);

    return vresult;
}

This compiles to the following JIT Asm

C.Dot(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx r10*8]
    L0019: vmulpd ymm1, ymm1, [r8 r10*8]
    L001f: vaddpd ymm0, ymm1, ymm0
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

C.Dot2(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx r10*8]
    L0019: vfmadd132pd ymm1, ymm0, [r8 r10*8]
    L001f: vmovaps ymm0, ymm1
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

When I benchmark this code using my Intel processor and benchmark.net, I see a modest speedup as expected. But when I run it on my AMD Ryzen 5900X, it's about 30% slower on nearly every size of array. Is this a bug in AMD's implementation of vfmadd132pd or in Microsoft's compiler?

CodePudding user response：

You can see instruction latencies and throughputs at https://www.agner.org/optimize/instruction_tables.pdf. vfmadd132pd is not slow on Zen 3 compared to recent Intel * Lake processors.

CodePudding user response：

I don’t think there’re bugs anywhere. The story is more interesting than that.

On modern Intel CPUs, floating-point addition, multiplication and FMA instructions are all competing for two execution units P0 and P1.

On modern AMD CPUs, multiplication and FMA instructions can run on FP0 or FP1 execution units, while floating-point addition runs on FP2 or FP3 execution units. For this reason, when throughput is measured in instructions/second instead of FLOPs, an instruction mix of 50% additions / 50% multiplications which you have in your Dot method is twice as fast compared to a program of 100% FMA instructions.

CodePudding user response：

Near duplicate of this Q&A about Unrolling dot-product loops with multiple accumulators - you bottleneck on vaddpd or vfma...pd latency, not throughput, and yes Zen3 has lower latency FP vaddpd (3 cycles) than FMA (4 cycles).

Intel Skylake / Ice Lake CPUs have 4-cycle latency for all FP add/sub/mul/fma operations, running them on the same execution units. I wouldn't have expected a speedup from FMA, since the front-end shouldn't be a bottleneck. Maybe in the add version, sometimes an independent multiply delays an add by a cycle, hurting the critical path dependency chain? Unlikely; oldest-ready-first uop scheduling should mean the independent work (multiplies) are way ahead of the adds.

Intel Haswell, Broadwell, and Alder Lake have vaddpd latency of 3 cycles, less than their FMA latencies, so you'd see a benefit there.

But if you unroll with multiple accumulators, you can hide FP latency and bottleneck on throughput. Or on load throughput, since you need 2 loads per FMA.

AMD does have the FP throughput to run 2 multiplies and 2 adds per clock on Zen2 and later, but Intel doesn't. Although with the load bottleneck you'd only get 1 each per clock anyway.

See also Latency bounds and throughput bounds for processors for operations that must occur in sequence re: dependency chains and latency.

Your current version is taking advantage of the data parallelism in a dot product using SIMD, more work per instruction. But you're not doing anything to let the CPU find any instruction-level parallelism between those SIMD vector operations. So you're missing one of the three factors that can scale performance (SIMD parallelism, ILP, and thread-level parallelism for huge arrays.)