aarch64 xtn2 clearing lower half-CodePudding

Is there an operation similar to xtn2 but that actually clears the lower half instead of leaving it as is? I have a 128 bit vector v0 whose view as 4s is {a,x,b,y} with x and y irrelevant. I want to obtain {0,0,a,b}. If I do

xtn2     v0.4s, v0.2d
mov      v0.d[0], xzr

I get the result I want. Is there a way to do this with one instruction or in a more efficient way?

CodePudding user response：

(Credit to fuz for the original suggestion)

If you can spare another register to hold zero, say v1, then you can do

uzp1 v0.4s, v1.4s, v0.4s

In general uzp1 vd.t, vn.t, vm.t packs the even-numbered elements (zero-based) of vn into the low half of vd, and the even elements of vm into the high half. (uzp2 does the same for the odd elements.) So if v1 is zero, then you get zeros in the low half of the result, and the 0th and 2nd elements of v0 in the high half, which are your a and b.

Note that if you're doing this many times, then v1 can be initialized to zero once and used throughout your code, since it is not written by this instruction. (It would be easier if ARM had supplied a zero SIMD register vzr.) If we neglect the overhead of that, then this is pretty efficient. Looking at Cortex A-72 timings for instance (because I happen to have its Optimization Guide handy), uzp1 is the cheapest kind of SIMD instruction, 3 cycles latency and throughput of 2 per cycle (it can execute in either of the two SIMD arithmetic pipelines).

One performance note on your original version is that moving between SIMD and general-purpose registers is very expensive and should be avoided whenever possible. On Cortex A-72, mov v*, x* (special-case alias of ins) is 8 cycles latency and throughput of 1 per cycle, and it needs both the load pipeline (of which there is just one) as well as one of the two SIMD arithmetic pipelines. Conceivably there could be a special case for xzr, but no mention of one for Cortex A-72.

Unfortunately there's not a direct alternative for zeroing one element in a SIMD register, as far as I know; movi writes an immediate to all the elements, not just some of them. So even if you were sticking with xtn2, you might want to keep zero in another register because ins v0.d[0], v1.d[0] is cheap.