Efficiently calculate hamming weight-CodePudding

I'm on an apple m1 processor.

What i'm trying to do is efficiently count the 1 bits in a large char array in rust. I looked up the arm neon instructions, and I think I can do it via a cnt instruction (which counts 1's per 8 bit block) and then a addv to add each of the 8 bit blocks.

to start with i figured i would feed in a 64 bit uint.

fn aarch64_hamming_weight(inp: u64) -> u64 {
    let mut outp: u64;
    unsafe {
        asm!(
            "dup v0.2d, {input}",
            "cnt v0.16b, v0.16b",
            "addv b1, v0.16b",// this adds all the bytes of the vector, how do i get the result?!
            //"fmov {output2}, b1",
            "fmov {output}, v0.d[1]",
            input = in(reg) inp,
            output = out(reg) outp,
        )
    }

    return outp;
}

this nearly works, but if your number is greater than 8 bits then the result isn't quite right; 225's binary count is 4 425's binary count is 260

I need a way to get the addv's output
I need a way for it to work with an arbitrary char array (which would need a loop, pulling 128 bits at a time)

CodePudding user response：

When I started writing an answer, I thought it was going to show that inline asm was totally pointless, and RustC was going to do a good job vectorizing u8.count_ones(). That is unfortunately not the case, but you should only use asm if you're going to write a whole loop, not one u64 at a time. (Possibly still inline asm, but writing a whole function might be reasonable.)

If Rust has SIMD intrinsics for ARM/AArch64, that would be a much better choice than inline asm, and definitely worth using.

For counting a whole array, you don't want to reduce every cnt result to scalar separately, especially not all the way to a general-purpose integer register. Especially not for a single u64 at a time instead of a full 128 bits. Add into a vector of counts that you only reduce at the end.

clang/LLVM knows how to auto-vectorize popcount __builtin_popcount() in C over an array in C, so I was hoping LLVM-based RustC would do ok for AArch64. It is somewhat ok for u64, but terrible for u8 unfortunately. If you can safely point a u64 span at your data (or reference or however Rust does things?), then you can get a good fraction of the benefit portably with no brittle inline asm. And in a way that future compiler versions can hopefully improve on.

pub fn hamming_weight_arr(inp: &[u64]) -> u64 {
    let mut sum : u64 = 0;
    for v in inp {
        sum  = v.count_ones() as u64;  // maybe not terrible but a lot of hsumming.
        // doing ldp q2, q3 to load 32 bytes per iteration
    }
    return sum;
}

compiled on https://godbolt.org/z/7qP7cK6bv with -O --target aarch64-unknown-linux-gnu (With the default -C --opt-level=3).

... some setup
.LBB1_5:                                 // inner loop
        ldp     q2, q3, [x8, #-16]       // load pair of 16-byte vectors = unroll by 4x u64
        add     x8, x8, #32              // pointer increment by 32 bytes
        subs    x12, x12, #4
        cnt     v2.16b, v2.16b
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b            // hsum byte pairs to u16 halfwords
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h             // hsum u16 pairs to u32 words
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s          // sum u32 pairs into accumulator u64 vectors (doublewords)
        uadalp  v1.2d, v3.4s
        b.ne    .LBB1_5
        add     v0.2d, v1.2d, v0.2d        // combine pair of accumulators
        cmp     x10, x11
        addp    d0, v0.2d                  // and reduce to one 64-bit
        fmov    x8, d0
        b.eq    .LBB1_9
.LBB1_7:
        add     x10, x0, x1, lsl #3
.LBB1_8:
        ldr     d0, [x9], #8               // scalar cleanup loop, one u64 at a time
        cmp     x9, x10
        cnt     v0.8b, v0.8b
        uaddlv  h0, v0.8b                  // slow instruction, as Jake mentioned.  Or at least high latency
        fmov    w11, s0
        add     x8, x11, x8
        b.ne    .LBB1_8
.LBB1_9:
        mov     x0, x8
        ret

You'd think sum: u32 would help, requiring less widening inside the loop. (If you have huge arrays that could overflow that, use an outer loop). But actually, RustC still widens to 64-bit I think, but then does even more work to truncate those counts to 32-bit. Almost certainly a missed-optimization. (Perhaps a strategy built around x86 psadbw which does sum bytes into u64 chunks in one step; LLVM auto-vectorizes popcount with pshufb and that for AVX2 on x86.)

And you'd think doing the same thing for a u8 array should vectorize to the same code, with just some extra scalar cleanup, right? Well it should, but actually it still only unrolls by 4 like LLVM likes to do, but that's 4 u8 elements with the inner loop becoming total garbage:

// &[u8] version  inner loop is a disaster

LBB2_5:
        ldurb   w12, [x8, #-2]            // scalar zero-extending byte load
        subs    x11, x11, #4
        ldrb    w13, [x8]                 // scalar sign-extending byte load
        fmov    d2, x12                   // copy it into vector reg
        ldurb   w12, [x8, #-1]
        fmov    d3, x13
        ldrb    w13, [x8, #1]
        add     x8, x8, #4
        mov     v2.d[1], x12              // get two more bytes into the high 64 bits of a vector
        mov     v3.d[1], x13
        cnt     v2.16b, v2.16b           // same cnt / uaddlp sequence as before
        cnt     v3.16b, v3.16b
        uaddlp  v2.8h, v2.16b
        uaddlp  v3.8h, v3.16b
        uaddlp  v2.4s, v2.8h
        uaddlp  v3.4s, v3.8h
        uadalp  v0.2d, v2.4s
        uadalp  v1.2d, v3.4s
        b.ne    .LBB2_5

So it's vectorizing (v as u64).count(), and using canned recipes that it doesn't know how to optimize into. (e.g. the uaddlp widening is pointless if the cnt results are zero except in the low byte of each 64-bit chunk, they could just use vertical add with 64-bit chunks.)

Compare against what you get from a C compiler for manually vectorized ARM code from https://github.com/WojciechMula/sse-popcount/. If the ARM Neon code is as well-tuned as the x86 code in the same repo, that's what you should be aiming for in terms of final asm output from your compiler, however you get there.

I'd guess that an inner loop that only widens the per-byte counts to 16-bit would be good, running up to the number of iterations that are possible without overflowing a u16 with =16 from seeing all-ones in the pair of bytes that feed into it. i.e. 65535/16 rounded down = 4095 vectors before widening out to 64-bit chunks.

Mula's version does only 31 inner loop iterations, accumulating into 8-bit elements with vaddq_u8. But uadalp v0.16b, v2.8h horizontal-pair accumulation of u8 bytes into u16 halfword elements wasn't available for 32-bit NEON, only for AArch64 ASIMD.

It's a slightly slower instruction on some CPUs, but probably still worth using. Or else possibly avoiding it if it's slower on some CPUs? The Cortex-A57 optimization guide says it has 4 cycle latency, 1/clock throughput. vs. regular add on integer vectors being 3 cycle latency, 2/clock throughput. But cnt is only 1/clock throughput as well on that CPU. (The manual So with need multiple accumulators to hide latency, and other loop overhead, you might be ok with an extra cycle of latency in the dependency chain for the vector accumulators in the inner loop.

The key point is to do the minimum amount of work in the inner-most loop, ideally using only one cheap instruction per vector of cnt results. If you can get some widening for free as you accumulate (or cheap enough not to be a bottleneck), that lets execution stay in the inner loop for much longer without possible overflow. (And means the later horizontal reduction step in the outer loop is cheaper.)

It would be possible to do (pseudocode)

// Probably not a good idea
c1 = cnt(load1)
c2 = cnt(load2)
c1  = c2    // plain SIMD vertical add
uadalp accumulator, c1

That means only one uadalp in the loop, in case that's a throughput bottleneck, but does nothing to help with the possible latency bottleneck of the loop-carried dependency chain through the accumulator.

And it makes the short independent dependency chain within one iteration slightly longer (from load to input of accumulation), instead of keeping it as two separate dep chains. Low-end ARM CPUs generally have very limited out-of-order execution capabilities (small scheduler buffers) to find instruction-level parallelism and overlap work across loop iterations. Keeping the non-loop-carried dep chains short makes that easier for the CPU. And it totally sucks for an in-order AArch64 like Cortex-A53 unless you're unrolling a lot.