I'm on an apple m1 processor.
What i'm trying to do is efficiently count the 1 bits in a large char array in rust. I looked up the arm neon instructions, and I think I can do it via a cnt instruction (which counts 1's per 8 bit block) and then a addv to add each of the 8 bit blocks.
to start with i figured i would feed in a 64 bit uint.
fn aarch64_hamming_weight(inp: u64) -> u64 {
let mut outp: u64;
unsafe {
asm!(
"dup v0.2d, {input}",
"cnt v0.16b, v0.16b",
"addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
//"fmov {output2}, b1",
"fmov {output}, v0.d[1]",
input = in(reg) inp,
output = out(reg) outp,
)
}
return outp;
}
this nearly works, but if your number is greater than 8 bits then the result isn't quite right; 225's binary count is 4 425's binary count is 260
so
- I need a way to get the addv's output
- I need a way for it to work with an arbitrary char array (which would need a loop, pulling 128 bits at a time)
CodePudding user response:
When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing u8.count_ones()
. That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not one u64
at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)
If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.
For counting a whole array, you don't want to reduce every cnt
result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a single u64
at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.
clang/LLVM knows how to auto-vectorize popcount __builtin_popcount()
in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok for u64
, but terrible for u8
unfortunately. If you can safely point a u64
span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.
pub fn hamming_weight_arr(inp: &[u64]) -> u64 {
let mut sum : u64 = 0;
for v in inp {
sum = v.count_ones() as u64; // maybe not terrible but a lot of hsumming.
// doing ldp q2, q3 to load 32 bytes per iteration
}
return sum;
}
compiled on https://godbolt.org/z/7qP7cK6bv with -O --target aarch64-unknown-linux-gnu
(With the default -C --opt-level=3
).
... some setup
.LBB1_5: // inner loop
ldp q2, q3, [x8, #-16] // load pair of 16-byte vectors = unroll by 4x u64
add x8, x8, #32 // pointer increment by 32 bytes
subs x12, x12, #4
cnt v2.16b, v2.16b
cnt v3.16b, v3.16b
uaddlp v2.8h, v2.16b // hsum byte pairs to u16 halfwords
uaddlp v3.8h, v3.16b
uaddlp v2.4s, v2.8h // hsum u16 pairs to u32 words
uaddlp v3.4s, v3.8h
uadalp v0.2d, v2.4s // sum u32 pairs into accumulator u64 vectors (doublewords)
uadalp v1.2d, v3.4s
b.ne .LBB1_5
add v0.2d, v1.2d, v0.2d // combine pair of accumulators
cmp x10, x11
addp d0, v0.2d // and reduce to one 64-bit
fmov x8, d0
b.eq .LBB1_9
.LBB1_7:
add x10, x0, x1, lsl #3
.LBB1_8:
ldr d0, [x9], #8 // scalar cleanup loop, one u64 at a time
cmp x9, x10
cnt v0.8b, v0.8b
uaddlv h0, v0.8b // slow instruction, as Jake mentioned. Or at least high latency
fmov w11, s0
add x8, x11, x8
b.ne .LBB1_8
.LBB1_9:
mov x0, x8
ret
You'd think sum: u32
would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86 psadbw
which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)
And you'd think doing the same thing for a u8
array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4 u8
elements with the inner loop becoming total garbage:
// &[u8] version inner loop is a disaster
LBB2_5:
ldurb w12, [x8, #-2] // scalar zero-extending byte load
subs x11, x11, #4
ldrb w13, [x8] // scalar sign-extending byte load
fmov d2, x12 // copy it into vector reg
ldurb w12, [x8, #-1]
fmov d3, x13
ldrb w13, [x8, #1]
add x8, x8, #4
mov v2.d[1], x12 // get two more bytes into the high 64 bits of a vector
mov v3.d[1], x13
cnt v2.16b, v2.16b // same cnt / uaddlp sequence as before
cnt v3.16b, v3.16b
uaddlp v2.8h, v2.16b
uaddlp v3.8h, v3.16b
uaddlp v2.4s, v2.8h
uaddlp v3.4s, v3.8h
uadalp v0.2d, v2.4s
uadalp v1.2d, v3.4s
b.ne .LBB2_5
So it's vectorizing (v as u64).count()
, and using canned recipes that it doesn't know how to optimize into. (e.g. the uaddlp
widening is pointless if the cnt
results are zero except in the low byte of each 64-bit chunk, they could just use vertical add
with 64-bit chunks.)
Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.
I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a u16
with =16
from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.
Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with vaddq_u8
. But uadalp v0.16b, v2.8h
horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.
It's a slightly slower instruction on some CPUs, but probably still worth using. Or else possibly avoiding it if it's slower on some CPUs? The Cortex-A57 optimization guide says it has 4 cycle latency, 1/clock throughput. vs. regular add
on integer vectors being 3 cycle latency, 2/clock throughput. But cnt
is only 1/clock throughput as well on that CPU. (The manual So with need multiple accumulators to hide latency, and other loop overhead, you might be ok with an extra cycle of latency in the dependency chain for the vector accumulators in the inner loop.
The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of cnt
results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)
It would be possible to do (pseudocode)
// Probably not a good idea
c1 = cnt(load1)
c2 = cnt(load2)
c1 = c2 // plain SIMD vertical add
uadalp accumulator, c1
That means only one uadalp
in the loop, in case that's a throughput bottleneck, but does nothing to help with the possible latency bottleneck of the loop-carried dependency chain through the accumulator.
And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains. Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.