How to make MSVC generate assembly which caches memory in a register?-CodePudding

I have a type mat4 which represents a float[4][4]. Internally it uses a 512-bit register.

union alignas(16 * sizeof(float)) mat4 {
private:
    __m512 m512;
    __m512d m512d;
    ALWAYS_INLINE mat4(__m512 m512) : m512{m512} {}
    ALWAYS_INLINE mat4(__m512d m512d) : m512d{m512d} {}
    ALWAYS_INLINE operator __m512&() { return m512; }
    ALWAYS_INLINE operator __m512d&() { return m512d; }
    ALWAYS_INLINE operator const __m512&() const { return m512; }
    ALWAYS_INLINE operator const __m512d&() const { return m512d; }
    ALWAYS_INLINE mat4& operator=(__m512 _m512) {
        m512 = _m512;
        return *this;
    }
    ALWAYS_INLINE mat4& operator=(__m512d _m512d) {
        m512d = _m512d;
        return *this;
    }

public:
    friend void __vectorcall transform_children(mat4 parent, std::span<mat4> children);
};

I also have a function transform_children(mat4 parent, std::span<mat4> children). It treats all mat4s as transformation matrices and transforms all the children (in place) by multiplying them with the parent. I wrote¹ an optimised implementation using AVX512F intrinsics.

void __vectorcall transform_children(mat4 parent, std::span<mat4> children) {
    mat4* const __restrict bs = children.data();
    const size_t n = children.size();

    ASSUME(n != 0);

    const mat4 zmm1 = _mm512_permute_ps(parent, 0);
    const mat4 zmm2 = _mm512_permute_ps(parent, 85);
    const mat4 zmm3 = _mm512_permute_ps(parent, 170);
    const mat4 zmm0 = _mm512_permute_ps(parent, 255);

    for (int i = 0; i < n;   i) {
        mat4& __restrict zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f64x2(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f64x2(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
    }
}

Both GCC and Clang compile this nearly literally to optimised assembly. Unfortunately, MSVC does something weird. For some reason, instead of loading the value of bs[i] to a register and then storing it back to the array at the end of the iteration, it accesses the memory 4 times:

void transform_children(mat4,std::span<mat4,4294967295>) PROC ; transform_children, COMDAT
        mov     ecx, DWORD PTR _children$[esp]
        vpermilps zmm4, zmm0, 0
        vpermilps zmm5, zmm0, 85                        
        vpermilps zmm6, zmm0, 170                 
        vpermilps zmm7, zmm0, 255                 
        test    ecx, ecx
        je      SHORT $LN36@transform_
        mov     eax, DWORD PTR _children$[esp-4]
        npad    8
$LL4@transform_:
        lea     eax, DWORD PTR [eax 64]
        vmovupd zmm3, ZMMWORD PTR [eax-64]              ; HERE
        vshuff64x2 zmm0, zmm3, zmm3, 85       
        vmulps  zmm0, zmm0, zmm5
        vshuff64x2 zmm1, zmm3, zmm3, 0
        vmovups zmm2, zmm4
        vfmadd213ps zmm2, zmm1, zmm0 
        vshuff64x2 zmm0, zmm3, zmm3, 255 
        vmovupd ZMMWORD PTR [eax-64], zmm0              ; HERE
        vfmadd231ps zmm2, zmm7, ZMMWORD PTR [eax-64]    ; HERE
        vshuff64x2 zmm1, zmm3, zmm3, 170               
        vmovups zmm0, zmm6
        vfmadd213ps zmm0, zmm1, zmm2
        vmovups ZMMWORD PTR [eax-64], zmm0              ; HERE
        sub     ecx, 1
        jne     SHORT $LL4@transform_
$LN36@transform_:
        vzeroupper
        ret     8
void transform_children(mat4,std::span<mat4,4294967295>) ENDP ; transform_children

What could I do to make MSVC access memory only twice, like GCC and Clang² do?

^{1. To be precise, GCC and Clang wrote this implementation (sort of). First, I wrote a the typical implementation using two nested loops. Then, I ran it through GCC using -mavx512f. GCC was smart enough to generate optimised vectorised code. Then, I converted this vectorised code from assembly back to C using intrinsics. Then, I compiled the new intrinsic code with Clang and it generated an even faster vectorised assembly. Then I converted Clang's assembly to C intrinsics again.}

^{2. Clang accesses memory 4 times, but it unrolls the loop, so still two accesses per iteration}

CodePudding user response：

TL:DR: it turns out that MSVC does a bad job when it has to convert between __m512d and __m512 through the overloaded conversions of your mat4 class. So just do everything with __m512 intrinsics, including the shuffling of 128-bit lanes.

MSVC making worse code is unfortunate but not shocking; MSVC's optimizer is well known to be not as good in general. MSVC doesn't do strict-aliasing, although __m512 can alias anything anyway so IDK if that's relevant here.

Seems like you should just use a __m512 (or maybe mat4) temporary variable instead of telling the compiler to modify bs[i] repeatedly and hope it actually doesn't.

Especially across implicit conversion from __m512d (from the pd aka f64 shuffles) to mat4 to __m512 (for single-precision FMAs) and back. _mm512_shuffle_f32x4 is a drop-in replacement for _mm512_shuffle_f64x2; both use the shuffle-control immediate to select 128-bit lanes, and 32 vs. 64-bit element granularity for masking doesn't matter since you're not masking. It's more idiomatic to be using f32x4 shuffles on packed-float data, so generally prefer that anyway.

Writing it like this gets MSVC to make the asm you want; using a __m512 variable required me to make the intrinsics types all match (if I didn't want to sprinkle it with _mm512_castps_pd and pd_ps around the shuffles); in fact that's what first let to me noticing the __m512d vs. __m512 type difference.

    for (int i = 0; i < n;   i) {
        __m512 zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
        bs[i] = zmm4;
    }

MSVC 19.32 (Godbolt, same as v19.latest) is reloading your zmm0 constant from _zmm0$1$[esp 64] at the bottom of the loop, right before the vmovups [eax-64], zmm1 store into bs[i]. It seems to use ZMM3 as a temporary later in the loop, overwriting the constant. It also has a couple instructions like vmovups zmm1, zmm7.

But that only happens in a 32-bit build like you linked, not a normal 64-bit build like https://godbolt.org/z/GWszEnfP5 where it doesn't spill any vector constants to the stack. (It does save/restore XMM6 and 7, though; IDK if Windows x64 made XMM16..31 all call-preserved like XMM6..15 are. You'd hope not, that's way too many call-preserved registers.) It still only used ZMM0..7, so it could have done that in 32-bit code, it just failed.

GCC targeting 32-bit mode with -mabi=ms doesn't have those wasted zmm to zmm move instructions; it's able to arrange its FMAs to modify zmm4 (in ZMM0) in-place, scheduling the shuffles appropriately so the registers can be reused. (https://godbolt.org/z/9sGbYn71o)

Using the same vector type for all intrinsics also works for MSVC

Even with the reference, we get asm without extra store/reload of zmm4 on Godbolt with x86 MSVC v19.latest after just changing the shuffles to be f32x4.

    for (int i = 0; i < n;   i) {
        mat4& __restrict zmm4 = bs[i];
        mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
        zmm5 = _mm512_mul_ps(zmm5, zmm2);
        mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
        zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
        zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
        zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
        zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
        zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
        //bs[i] = zmm4;
    }

I think it's more idiomatic to write code as loading a vector into a register, then processing, then storing back to memory. Especially with a name like zmm4, that seems odd for a reference variable; if you're thinking in terms of asm and registers, reference variables aren't a thing. A name like zmm4 doesn't imply any magic that will leave memory updated after modifying a register.

Using a non-reference means you're only modifying a local __m512 (or mat4 if you want to use a non-reference mat4), which is always easier for compilers to optimize into a register. (Although in your loop there aren't any other memory references that it could alias with, even without __restrict.)

BTW, intrinsics let you use slightly meaningful names for vector variables, like vmat, mati, vbsi, or vchild, not zmm4. It's unlikely that the compiler will actually keep your C zmm4 variable in the ZMM4 register, so it's more mental effort to compare the asm to the C when naming vars this way. e.g. you get instructions like vmovups zmm3, ZMMWORD PTR _zmm0$1$[esp 64]

Using names like zmm0 is usually throwing away one of the clarity / readability advantages of intrinsics over assembly.

In fact you'd prefer the compiler to use ZMM16..31 so it wouldn't need a vzeroupper when it's done. Except you linked a 32-bit build on Godbolt?? That's weird, so you only have ZMM0..7. You linked a 64-bit build for GCC.

CodePudding user response：

By defining these

ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }

you technically break the grounding for __restrict: the references returned in different places in the intrinsics using zmm4 point to the same locations, so you are aliasing. It seems that MSVC correctly concludes that you are aliasing. Thus the compiler reloads the value from memory each time.

Please, note that your __restrict here says about the this reference of a mat4 object, but not about the references returned by the conversion operators quoted above:

mat4& __restrict zmm4 = bs[i];

Not only are you aliasing, but you are also punning the type (though in a legal way - through a union).

The best solution should be to use the casting intrinsics, as well as store the temporary values in a dedicated const variable. This way you should get the optimizations.