Efficient overflow-immune arithmetic mean in C/C-CodePudding

The arithmetic mean of two unsigned integers is defined as:

mean = (a b)/2

Directly implementing this in C/C may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:

mean = a/2 b/2 (a%2 b%2)/2

But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):

ADD a,b   ; addition, leaving the overflow condition in the carry bit
RCR a,1   ; rotate right through carry, effectively a division by 2

After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.

Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.

Is there a known C/C pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C source level so that the carry bit is being used by the compiler to optimize the generated code?

EDIT:

A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI

Wow!

CodePudding user response：

The following method avoids overflow and should result in fairly effcient assembly (example) without depending on non-standard features:

    mean = (a&b)   (a^b)/2;

CodePudding user response：

There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).

// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
   return (a & b)   ((a ^ b) >> 1);
}

// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
  return min   (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
  return a   (int32_t)(b - a) / 2;
}

// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
   return ((uint64_t)a   b) >> 1;
}

The corresponding assembler sequences from clang in two architectures are

avg(unsigned int, unsigned int)
    mov     eax, esi
    and     eax, edi
    xor     esi, edi
    shr     esi
    add     eax, esi

avg2(unsigned int, unsigned int)
    sub     esi, edi
    shr     esi
    lea     eax, [rsi   rdi]

avg3(unsigned int, unsigned int)
    mov     ecx, edi
    mov     eax, esi
    add     rax, rcx
    shr     rax

vs.

avg(unsigned int, unsigned int)         
    and     w8, w1, w0
    eor     w9, w1, w0
    add     w0, w8, w9, lsr #1
    ret
avg2(unsigned int, unsigned int)
    sub     w8, w1, w0
    add     w0, w0, w8, lsr #1
    ret
avg3(unsigned int, unsigned int):                                       
    mov     w8, w1
    add     x8, x8, w0, uxtw
    lsr     x0, x8, #1
    ret

Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.

Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:

add     rax, rcx
shr     rax

See https://godbolt.org/z/5TMd3zr81 for online comparison.

The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) (((a ^ b) >> 1) & ~kH;.