I have a type mat4
which represents a float[4][4]
. Internally it uses a 512-bit register.
union alignas(16 * sizeof(float)) mat4 {
private:
__m512 m512;
__m512d m512d;
ALWAYS_INLINE mat4(__m512 m512) : m512{m512} {}
ALWAYS_INLINE mat4(__m512d m512d) : m512d{m512d} {}
ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }
ALWAYS_INLINE mat4& operator=(__m512 _m512) {
m512 = _m512;
return *this;
}
ALWAYS_INLINE mat4& operator=(__m512d _m512d) {
m512d = _m512d;
return *this;
}
public:
friend void __vectorcall transform_children(mat4 parent, std::span<mat4> children);
};
I also have a function transform_children(mat4 parent, std::span<mat4> children)
. It treats all mat4
s as transformation matrices and transforms all the children
(in place) by multiplying them with the parent
. I wrote1 an optimised implementation using AVX512F intrinsics.
void __vectorcall transform_children(mat4 parent, std::span<mat4> children) {
mat4* const __restrict bs = children.data();
const size_t n = children.size();
ASSUME(n != 0);
const mat4 zmm1 = _mm512_permute_ps(parent, 0);
const mat4 zmm2 = _mm512_permute_ps(parent, 85);
const mat4 zmm3 = _mm512_permute_ps(parent, 170);
const mat4 zmm0 = _mm512_permute_ps(parent, 255);
for (int i = 0; i < n; i) {
mat4& __restrict zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f64x2(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f64x2(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f64x2(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
}
}
Both GCC and Clang compile this nearly literally to optimised assembly. Unfortunately, MSVC does something weird. For some reason, instead of loading the value of bs[i]
to a register and then storing it back to the array at the end of the iteration, it accesses the memory 4 times:
void transform_children(mat4,std::span<mat4,4294967295>) PROC ; transform_children, COMDAT
mov ecx, DWORD PTR _children$[esp]
vpermilps zmm4, zmm0, 0
vpermilps zmm5, zmm0, 85
vpermilps zmm6, zmm0, 170
vpermilps zmm7, zmm0, 255
test ecx, ecx
je SHORT $LN36@transform_
mov eax, DWORD PTR _children$[esp-4]
npad 8
$LL4@transform_:
lea eax, DWORD PTR [eax 64]
vmovupd zmm3, ZMMWORD PTR [eax-64] ; HERE
vshuff64x2 zmm0, zmm3, zmm3, 85
vmulps zmm0, zmm0, zmm5
vshuff64x2 zmm1, zmm3, zmm3, 0
vmovups zmm2, zmm4
vfmadd213ps zmm2, zmm1, zmm0
vshuff64x2 zmm0, zmm3, zmm3, 255
vmovupd ZMMWORD PTR [eax-64], zmm0 ; HERE
vfmadd231ps zmm2, zmm7, ZMMWORD PTR [eax-64] ; HERE
vshuff64x2 zmm1, zmm3, zmm3, 170
vmovups zmm0, zmm6
vfmadd213ps zmm0, zmm1, zmm2
vmovups ZMMWORD PTR [eax-64], zmm0 ; HERE
sub ecx, 1
jne SHORT $LL4@transform_
$LN36@transform_:
vzeroupper
ret 8
void transform_children(mat4,std::span<mat4,4294967295>) ENDP ; transform_children
What could I do to make MSVC access memory only twice, like GCC and Clang2 do?
1. To be precise, GCC and Clang wrote this implementation (sort of). First, I wrote a the typical implementation using two nested loops. Then, I ran it through GCC using -mavx512f
. GCC was smart enough to generate optimised vectorised code. Then, I converted this vectorised code from assembly back to C using intrinsics. Then, I compiled the new intrinsic code with Clang and it generated an even faster vectorised assembly. Then I converted Clang's assembly to C intrinsics again.
2. Clang accesses memory 4 times, but it unrolls the loop, so still two accesses per iteration
CodePudding user response:
TL:DR: it turns out that MSVC does a bad job when it has to convert between __m512d
and __m512
through the overloaded conversions of your mat4
class. So just do everything with __m512
intrinsics, including the shuffling of 128-bit lanes.
MSVC making worse code is unfortunate but not shocking; MSVC's optimizer is well known to be not as good in general. MSVC doesn't do strict-aliasing, although __m512
can alias anything anyway so IDK if that's relevant here.
Seems like you should just use a __m512
(or maybe mat4
) temporary variable instead of telling the compiler to modify bs[i]
repeatedly and hope it actually doesn't.
Especially across implicit conversion from __m512d
(from the pd
aka f64
shuffles) to mat4
to __m512
(for single-precision FMAs) and back. _mm512_shuffle_f32x4
is a drop-in replacement for _mm512_shuffle_f64x2
; both use the shuffle-control immediate to select 128-bit lanes, and 32 vs. 64-bit element granularity for masking doesn't matter since you're not masking. It's more idiomatic to be using f32x4
shuffles on packed-float data, so generally prefer that anyway.
Writing it like this gets MSVC to make the asm you want; using a __m512
variable required me to make the intrinsics types all match (if I didn't want to sprinkle it with _mm512_castps_pd
and pd_ps
around the shuffles); in fact that's what first let to me noticing the __m512d
vs. __m512
type difference.
for (int i = 0; i < n; i) {
__m512 zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
bs[i] = zmm4;
}
MSVC 19.32 (Godbolt, same as v19.latest) is reloading your zmm0
constant from _zmm0$1$[esp 64]
at the bottom of the loop, right before the vmovups [eax-64], zmm1
store into bs[i]
. It seems to use ZMM3 as a temporary later in the loop, overwriting the constant. It also has a couple instructions like vmovups zmm1, zmm7
.
But that only happens in a 32-bit build like you linked, not a normal 64-bit build like https://godbolt.org/z/GWszEnfP5 where it doesn't spill any vector constants to the stack. (It does save/restore XMM6 and 7, though; IDK if Windows x64 made XMM16..31 all call-preserved like XMM6..15 are. You'd hope not, that's way too many call-preserved registers.) It still only used ZMM0..7, so it could have done that in 32-bit code, it just failed.
GCC targeting 32-bit mode with -mabi=ms
doesn't have those wasted zmm to zmm move instructions; it's able to arrange its FMAs to modify zmm4
(in ZMM0) in-place, scheduling the shuffles appropriately so the registers can be reused. (https://godbolt.org/z/9sGbYn71o)
Using the same vector type for all intrinsics also works for MSVC
Even with the reference, we get asm without extra store/reload of zmm4
on Godbolt with x86 MSVC v19.latest after just changing the shuffles to be f32x4
.
for (int i = 0; i < n; i) {
mat4& __restrict zmm4 = bs[i];
mat4 zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 85);
zmm5 = _mm512_mul_ps(zmm5, zmm2);
mat4 zmm6 = _mm512_shuffle_f32x4(zmm4, zmm4, 0);
zmm6 = _mm512_fmadd_ps(zmm1, zmm6, zmm5);
zmm5 = _mm512_shuffle_f32x4(zmm4, zmm4, 170);
zmm4 = _mm512_shuffle_f32x4(zmm4, zmm4, 255);
zmm4 = _mm512_fmadd_ps(zmm0, zmm4, zmm6);
zmm4 = _mm512_fmadd_ps(zmm3, zmm5, zmm4);
//bs[i] = zmm4;
}
I think it's more idiomatic to write code as loading a vector into a register, then processing, then storing back to memory. Especially with a name like zmm4
, that seems odd for a reference variable; if you're thinking in terms of asm and registers, reference variables aren't a thing. A name like zmm4
doesn't imply any magic that will leave memory updated after modifying a register.
Using a non-reference means you're only modifying a local __m512
(or mat4
if you want to use a non-reference mat4
), which is always easier for compilers to optimize into a register. (Although in your loop there aren't any other memory references that it could alias with, even without __restrict
.)
BTW, intrinsics let you use slightly meaningful names for vector variables, like vmat
, mati
, vbsi
, or vchild
, not zmm4
. It's unlikely that the compiler will actually keep your C zmm4
variable in the ZMM4 register, so it's more mental effort to compare the asm to the C when naming vars this way. e.g. you get instructions like vmovups zmm3, ZMMWORD PTR _zmm0$1$[esp 64]
Using names like zmm0
is usually throwing away one of the clarity / readability advantages of intrinsics over assembly.
In fact you'd prefer the compiler to use ZMM16..31 so it wouldn't need a vzeroupper
when it's done. Except you linked a 32-bit build on Godbolt?? That's weird, so you only have ZMM0..7. You linked a 64-bit build for GCC.
CodePudding user response:
By defining these
ALWAYS_INLINE operator __m512&() { return m512; }
ALWAYS_INLINE operator __m512d&() { return m512d; }
ALWAYS_INLINE operator const __m512&() const { return m512; }
ALWAYS_INLINE operator const __m512d&() const { return m512d; }
you technically break the grounding for __restrict
: the references returned in different places in the intrinsics using zmm4
point to the same locations, so you are aliasing. It seems that MSVC correctly concludes that you are aliasing. Thus the compiler reloads the value from memory each time.
Please, note that your __restrict
here says about the this
reference of a mat4
object, but not about the references returned by the conversion operators quoted above:
mat4& __restrict zmm4 = bs[i];
Not only are you aliasing, but you are also punning the type (though in a legal way - through a union
).
The best solution should be to use the casting intrinsics, as well as store the temporary values in a dedicated const
variable. This way you should get the optimizations.