SIMD Intrinsics difference between Vector<T>, advsimd and sse?-CodePudding

If you look in the source code of the System.Numerics.Matrix4x4 class of .NET under multiply and other functions, it does an if check to see if hardware supports respectively:

if (AdvSimd.Arm64.IsSupported) {} else if (Sse.IsSupported) {}

But the generic System.Numerics.Vector<T> struct seems to do all the same, what is the difference? Does Vector<T> not simply look behind the scenes and use whichever is available, and then a software fallback if none of them are?

CodePudding user response：

C# System.Numerics Vector<T> generic SIMD doesn't expose all the shuffles and other ISA-specific things like x86 movmskps. If you can get the job done efficiently with the common subset of functionality exposed with the generic API, I'd assume that would be a good choice and still compile to the instructions you'd exepct.

But the function you mentioned uses Sse.Shuffle (shufps) or AdvSimd.Arm64.FusedMultiplyAddBySelectedScalar (?) to broadcast and mul add. If ARM64 can actually do that in a single instruction (scalar broadcast source for a vector multiply), that's pretty cool. The predecessor to AVX-512 could do that, KNC new instructions in early Xeon Phi, but even AVX-512 needs a shuffle and a separate FMA. (Unless the operand is coming from memory: AVX-512 can use a broadcast memory source operand.)

I don't see any shuffles at all in the docs you linked for System.Numerics, only pure vertical SIMD, so that's not very useful for a 4x4 matrix product where each row[i] needs to get multiplied by a broadcast(col[i]) vector.

So System.Numerics looks way more crippled that GNU C native vectors in C and C where there at least is a __builtin_shuffle, but still missing out on special shuffles, and stuff like x86 movmskps to get a scalar bitmap of SIMD compare results. (Which AMD and ARM64 have no direct equivalent for.)