Multiply by 2 with signed saturation in 6 operations in C?-CodePudding

The problem for signed 2's complement 32-bit integers:

satMul2 - multiplies by 2, saturating to Tmin or Tmax if overflow.
Examples: satMul2(0x30000000) = 0x60000000
satMul2(0x40000000) = 0x7FFFFFFF (saturate to TMax)
satMul2(0x60000000) = 0x80000000 (saturate to TMin)
Legal ops: ! ~ & ^ | << >>
Max ops: 20
Rating: 3

I want to implement a function similar to

if (a) return b;
else return c;

That's my solution (10 ops):

int satMul2(int x) {
    int rval = x << 1;
    int sign = rval ^ x;
    int minn = 1 << 31;
    int temp = sign >> 31;
    int flow = minn ^ (rval >> 31);
    return ((~temp) & rval) | (temp & flow);    
}

But professor told me there was a solution only 6 ops. Can someonne please give me some hints?

CodePudding user response：

Your solution has 2 instances of undefined behavior (one potential, one explicit) and 2 of explicit implementation defined behavior. Your professor would be well advised to teach about these instead of indulging in bogus challenges such as minimizing ops.

Here is a correct portable implementation:

#include <limits.h>

int satMul2(int x) {
    return x > INT_MAX / 2 ? INT_MAX :
           x < INT_MIN / 2 ? INT_MIN : x * 2;
}

As can be studied with Godbolt's Compiler Explorer, clang compiles this function to efficient branchless code:

satMul2:                                # @satMul2
        lea     eax, [rdi   rdi]
        cmp     edi, -1073741824
        mov     ecx, -2147483648
        cmovge  ecx, eax
        cmp     edi, 1073741824
        mov     eax, 2147483647
        cmovl   eax, ecx
        ret

CodePudding user response：

Unless any of the functionality of satMul2(), the available operations, or the counting of operations differs from what the question states, there is no solution in six operations, best I can tell.

For a straightforward and readily understandable implementation, see chqrlie's answer. If we insist on a bit-twiddling implementation, doing so in a portable way requires the use of unsigned int in intermediate computation. Overflow in signed integer addition can only occur if both addends have the same sign, and the sign of the result differs from the sign of both source operands. In two's complement arithmetic, which we assume here, the clamping bound is readily determined based on either the sign of the source operands or the sign of the raw addition result, as in the overflow case where the bound is needed these are opposites of each other.

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

unsigned int my_satmul2 (unsigned int a)
{
    const int max_shift = CHAR_BIT * sizeof (a) - 1;
    const unsigned int msb = 1u << max_shift;
    unsigned int r, bound;
    r = a   a;
    bound = (0 - (r >> max_shift))   msb; // alternative: msb - 1   (a >> max_shift)
    r = ((r ^ a) & msb) ? bound : r;
    return r;
}

int satMul2(int x) {
    return x > INT_MAX / 2 ? INT_MAX :
           x < INT_MIN / 2 ? INT_MIN : x * 2;
}

int main (void)
{
    unsigned int a, res, ref;

    a = 0;
    do {
        ref = satMul2 (a);
        res = my_satmul2 (a);
        if (res != ref) {
            printf ("a=x res=x ref=x\n", a, res, ref);
            return EXIT_FAILURE;
        }
        a  ;
    } while (a);
    return EXIT_SUCCESS;
}

Examining the generated code on Compiler Explorer, we find that clang 15.0.0 for x86-64 generates very efficient code:

my_satmul2:
        lea     ecx, [rdi   rdi]
        mov     eax, ecx
        sar     eax, 31
        add     eax, -2147483648
        xor     edi, ecx
        cmovns  eax, ecx
        ret

gcc 12.2 for x86-64 produces equally efficient but slightly different code:

my_satmul2:
        lea     eax, [rdi rdi]
        cdq
        add     edx, -2147483648
        xor     edi, eax
        cmovs   eax, edx
        ret

Since the question does not allow the use of the ternary operator, it must be replaced with a multiplexing construct based on a mask of all 0s or all 1s. Also, subtraction is not allowed, so a-b has to be replaced with a ~b.

unsigned int my_satmul2 (unsigned int a)
{
    const int max_shift = CHAR_BIT * sizeof (a) - 1;
    const unsigned int msb = 1u << max_shift;
    unsigned int r, bound, omask;
    r = a   a;
    omask = 1   ~((a ^ r) >> max_shift);
    bound = msb   ~(r >> max_shift)   1;
    r = ((bound ^ r) & omask) ^ r;
    return r;
}

On comparing to the asker's algorithm after creating the above code, I find that it uses essentially the same algorithm, but relies on the fact that right shifts on signed integer translate to an arithmetic right shift instruction, something which the C standard does not guarantee: "The result of E1 >> E2 is E1 right-shifted E2 bit positions ... If E1 has a signed type and a negative value, the resulting value is implementation-defined."

Interestingly, clang 15.0.0 for x86-64 generates exactly the same code for this now convoluted bit-twiddling implementation as for the simpler version using the ternary operator! gcc 12.2 for x86-64 does not fare quite as well:

my_satmul2:
        lea     edx, [rdi rdi]
        mov     eax, edx
        xor     edi, edx
        sar     eax, 31
        sar     edi, 31
        add     eax, -2147483648
        xor     eax, edx
        and     eax, edi
        xor     eax, edx
        ret

CodePudding user response：

Signed saturation bithack tricks

Note the saturation return values (0x7FFFFFFF and 0x80000000) are bitwise NOT of each other, and that we could generate them with tmp ^ 0x7FFFFFFF where tmp = x>>31, all-zero or all-one according to the sign. It's also ((unsigned)x>>31) (unsigned)INT_MAX, 0x7FFFFFFFu (0 or 1). (Unsigned to avoid signed-overflow UB in C.)

That's what Rust does, for x.saturating_mul(2) or x.saturating_add(x). (Yes, that's right, Rust has saturating add and mul as primitive operations on all primitive integer types. If you were using Rust, would you count this as one "operation"? CPUs don't run source code, they run asm. Some languages have more limited operations than most CPUs, especially C which also lacks rotates, popcount, clz/ctz; C only added those in C 20 with #include <bit>.)

// Rust
pub fn saturating_mul2(x: i32) -> i32 {
    x.saturating_mul(2)
}

pub fn saturating_add_self(x: i32) -> i32 {
    x.saturating_add(x)
}

Godbolt with asm for x86-64, AArch64, and --target armv7-unknown-linux-gnueabi:

example::saturating_mul2:           @@ ARMv7
        adds    r1, r0, r0               @ add, updating flags (S suffix)
        mvn     r2, #-2147483648         @ r2 = ~0x80000000 = 0x7FFFFFFF
        eorvs   r1, r2, r0, asr #31      @ if (adds overflowed) r1 = 0x7FFFFFFF ^ (x>>31);
                 @ eorvs is predicated on V Set = signed overflow, otherwise r1 left unmodified holding x x
        mov     r0, r1                @ get the result into the same reg as the input
        bx      lr

example::saturating_add_self:
        qadd    r0, r0, r0          @ yup, ARMv7 has an instruction for this!
        bx      lr                  @ the mul(2) version is a missed optimization

Or with AArch64:

example::saturating_mul2:
        asr     w8, w0, #31
        adds    w9, w0, w0                // x x and set flags
        eor     w8, w8, #0x7fffffff
        csel    w0, w8, w9, vs            // select 
        ret

example::saturating_add_self:
        adds    w8, w0, w0             // x x and set flags
        asr     w9, w8, #31            // (x<<1)>>31 is the opposite of the sign of the correct result, if there was overflow
        eor     w9, w9, #0x80000000
        csel    w0, w9, w8, vs
        ret

Note that LLVM's strategy for the add_self version is like a general x y saturation. It uses the opposite of the sign of the x x result from adds, so it has worse latency and instruction-level parallelism. (The right shift has to wait for the add, instead of running separately on the same input). Signed overflow is when the actual result has a different sign from the mathematically correct result, so this is a good trick if you have two separate inputs that might have different signs. But actually that's not necessary: can only overflow if both inputs have the same sign, so you can just pick either one.

x86-64 comes out pretty similar to the AArch64 versions, but with an extra mov instruction. (Or with the .saturating_add version, redoing the , once with lea to feed a sar right shift, once with add to generate the potential return value and set OF for a cmov.)

There's an optional / proposed C extension with _Sat types; see an SO answer, and an example of using it for addition: https://godbolt.org/z/5EdP1EnxT

How to get a C compiler to emit asm like this?

Generating the saturation result as (x>>31) ^ 0x7FFFFFFF is easy. At least if you can assume >> is arithmetic right shift. >> on signed negative integer is implementation-defined behaviour, but all mainstream compilers choose to define it in a useful way, at least on 2's complement systems.

So it's just a matter of detecting signed overflow in a left shift somehow.

Unfortunately in ISO C, signed integer overflow is undefined behaviour, including in x << 1. GNU C does define the behaviour of shifts even when they overflow (unlike for x x, for which you'd have to use __builtin_sadd_overflow), so IDK how portable you care about being.

If you're willing to use GNU C overflow detection builtins, (which can often compile to use the overflow flag set by addition, so that really is one primitive machine operation) see Signed saturated add of 64-bit ints? - for AArch64, clang emits 4 instructions, comparable to what rustc's saturating math does, although using add x9, x9, x1, lsr #63 to do INT64_MAX 0 or 1.

If you only care about 2's complement machines, you could use an unsigned type in the C source for the left shift. The whole bithack basically already assumes that, so we only need to use signed types for comparing to 0, or for arithmetic right shift.

#include <stdint.h>
// int32_t is guaranteed to be 2's complement with exactly 32 value bits, if it exists
// >> on it is *not* guaranteed to be arithmetic, but we can fairly safely assume that
int saturating_shl (int32_t x)
{
    int32_t saturated = (x>>31) ^ 0x7FFFFFFF;   // 2 operations, plus maybe a MOV
    int32_t x2 = ((uint32_t)x) << 1;            // 1 op.  avoid signed-integer overflow
    int32_t overflow = (x ^ x2);             // 1 op.  negative if x and x*2 have different signs, i.e. overflow
    return  (overflow < 0) ? saturated : x2;  // 1 op (cmov)
}

This is sub-optimal; the compiler fails to use the signed-overflow result of the ALU add or left-shift instruction, instead actually doing an XOR with the original value. But it is still only the 5 operations in the source code, e.g. GCC for x86-64 (Godbolt)

saturating_shl:
        mov     edx, edi
        lea     eax, [rdi rdi]       # x<<1
        sar     edx, 31              # x>>31
        xor     edx, 2147483647      # saturated
        xor     edi, eax             # x ^ x2 to set FLAGS
        cmovs   eax, edx             # SF is set when (x^x2) < 0
        ret

Update: I missed that the question arbitrarily disallowed the ternary operator. I'm not going to edit further as @njuffa posted an answer that avoids it, compiling to the same instructions as the ternary version with clang. (Although worse asm with GCC, which doesn't sort out the blend idiom back to a cmov / csel.)

Counting operations - CPU asm, not C operators

Real CPUs run machine code, not C operators. What matters when micro-optimizing is how efficiently your code can compile for specific machines. It's common for modern machines to have a conditional-select instruction like x86 cmov or AArch64 csel, so expressions like ((~temp) & rval) | (temp & flow); that manually blend according to a mask can hopefully compile to one machine operation, using a FLAGS condition instead of generating an integer mask from it.

Or if this is auto-vectorizing with SIMD, then a SIMD compare already produces a mask of all-0 / all-1 elements, and many ISAs have blend instructions which can apply them, like x86's SSE4.1 pblendvb. Also many ISAs have an andnot instruction like x86's SSE2 pandn which does (~mask) & rval in a single instruction, so that blend takes 3 cheap instructions, vs. one less-cheap pblendvb which is 2 uops on some CPUs; https://uops.info/.

But on the other hand, extra reg-reg mov or movdqa instructions are part of the cost you have to worry about, on machines like x86 without AVX that can only do z = x; z &= y;, not z = x&y in a single instruction.

TL:DR: counting the operators in C might be a rough proxy for cost, but it's not exact. There are other considerations than throughput, too, like latency (critical path length from inputs to output), and instruction-level parallelism.

Certain machines have instructions that can do things C doesn't have a single operator for. For example, 32-bit ARM has qadd, saturating signed addition, so a sufficiently smart compiler could optimize your function to one instruction if you write it correctly and the compiler recognizes the "idiom" you're using. Compilers can do this in practice for a few things, like rotates with (x >> (n&31)) | (x << ((-n)&31)) compiling to x86 rol

And many real CPUs set FLAGS based on signed overflow, or on the MSB, so a compare and ternary can sometimes be easier for the compiler to sort out than a right shift and using a mask.

Bitwise saturated addition in C (HW) - arbitrary x y with if(); a compiler can probably use predicated instructions for ARM.
C - Saturating Signed Integer Multiplication with Bitwise Operators
Signed saturated add of 64-bit ints? - using GNU C __builtin_saddll_overflow to get a bool overflow result from an addition, as well as the value.