Is there an operation similar to xtn2
but that actually clears the lower half instead of leaving it as is? I have a 128 bit vector v0
whose view as 4s
is {a,x,b,y}
with x and y irrelevant. I want to obtain {0,0,a,b}
. If I do
xtn2 v0.4s, v0.2d
mov v0.d[0], xzr
I get the result I want. Is there a way to do this with one instruction or in a more efficient way?
CodePudding user response:
(Credit to fuz for the original suggestion)
If you can spare another register to hold zero, say v1
, then you can do
uzp1 v0.4s, v1.4s, v0.4s
In general uzp1 vd.t, vn.t, vm.t
packs the even-numbered elements (zero-based) of vn
into the low half of vd
, and the even elements of vm
into the high half. (uzp2
does the same for the odd elements.) So if v1
is zero, then you get zeros in the low half of the result, and the 0th and 2nd elements of v0
in the high half, which are your a
and b
.
Note that if you're doing this many times, then v1
can be initialized to zero once and used throughout your code, since it is not written by this instruction. (It would be easier if ARM had supplied a zero SIMD register vzr
.) If we neglect the overhead of that, then this is pretty efficient. Looking at Cortex A-72 timings for instance (because I happen to have its Optimization Guide handy), uzp1
is the cheapest kind of SIMD instruction, 3 cycles latency and throughput of 2 per cycle (it can execute in either of the two SIMD arithmetic pipelines).
One performance note on your original version is that moving between SIMD and general-purpose registers is very expensive and should be avoided whenever possible. On Cortex A-72, mov v*, x*
(special-case alias of ins
) is 8 cycles latency and throughput of 1 per cycle, and it needs both the load pipeline (of which there is just one) as well as one of the two SIMD arithmetic pipelines. Conceivably there could be a special case for xzr
, but no mention of one for Cortex A-72.
Unfortunately there's not a direct alternative for zeroing one element in a SIMD register, as far as I know; movi
writes an immediate to all the elements, not just some of them. So even if you were sticking with xtn2
, you might want to keep zero in another register because ins v0.d[0], v1.d[0]
is cheap.