The problem for signed 2's complement 32-bit integers:
satMul2
- multiplies by 2, saturating toTmin
orTmax
if overflow.
Examples:satMul2(0x30000000) = 0x60000000
satMul2(0x40000000) = 0x7FFFFFFF
(saturate toTMax
)
satMul2(0x60000000) = 0x80000000
(saturate toTMin
)
Legal ops:!
~
&
^
|
<<
>>
Max ops: 20
Rating: 3
I want to implement a function similar to
if (a) return b;
else return c;
That's my solution (10 ops):
int satMul2(int x) {
int rval = x << 1;
int sign = rval ^ x;
int minn = 1 << 31;
int temp = sign >> 31;
int flow = minn ^ (rval >> 31);
return ((~temp) & rval) | (temp & flow);
}
But professor told me there was a solution only 6 ops. Can someonne please give me some hints?
CodePudding user response:
Your solution has 2 instances of undefined behavior (one potential, one explicit) and 2 of explicit implementation defined behavior. Your professor would be well advised to teach about these instead of indulging in bogus challenges such as minimizing ops.
Here is a correct portable implementation:
#include <limits.h>
int satMul2(int x) {
return x > INT_MAX / 2 ? INT_MAX :
x < INT_MIN / 2 ? INT_MIN : x * 2;
}
As can be studied with Godbolt's Compiler Explorer, clang compiles this function to efficient branchless code:
satMul2: # @satMul2
lea eax, [rdi rdi]
cmp edi, -1073741824
mov ecx, -2147483648
cmovge ecx, eax
cmp edi, 1073741824
mov eax, 2147483647
cmovl eax, ecx
ret
CodePudding user response:
Unless any of the functionality of satMul2()
, the available operations, or the counting of operations differs from what the question states, there is no solution in six operations, best I can tell.
For a straightforward and readily understandable implementation, see chqrlie's answer. If we insist on a bit-twiddling implementation, doing so in a portable way requires the use of unsigned int
in intermediate computation. Overflow in signed integer addition can only occur if both addends have the same sign, and the sign of the result differs from the sign of both source operands. In two's complement arithmetic, which we assume here, the clamping bound is readily determined based on either the sign of the source operands or the sign of the raw addition result, as in the overflow case where the bound is needed these are opposites of each other.
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
unsigned int my_satmul2 (unsigned int a)
{
const int max_shift = CHAR_BIT * sizeof (a) - 1;
const unsigned int msb = 1u << max_shift;
unsigned int r, bound;
r = a a;
bound = (0 - (r >> max_shift)) msb; // alternative: msb - 1 (a >> max_shift)
r = ((r ^ a) & msb) ? bound : r;
return r;
}
int satMul2(int x) {
return x > INT_MAX / 2 ? INT_MAX :
x < INT_MIN / 2 ? INT_MIN : x * 2;
}
int main (void)
{
unsigned int a, res, ref;
a = 0;
do {
ref = satMul2 (a);
res = my_satmul2 (a);
if (res != ref) {
printf ("a=x res=x ref=x\n", a, res, ref);
return EXIT_FAILURE;
}
a ;
} while (a);
return EXIT_SUCCESS;
}
Examining the generated code on Compiler Explorer, we find that clang 15.0.0 for x86-64 generates very efficient code:
my_satmul2:
lea ecx, [rdi rdi]
mov eax, ecx
sar eax, 31
add eax, -2147483648
xor edi, ecx
cmovns eax, ecx
ret
gcc 12.2 for x86-64 produces equally efficient but slightly different code:
my_satmul2:
lea eax, [rdi rdi]
cdq
add edx, -2147483648
xor edi, eax
cmovs eax, edx
ret
Since the question does not allow the use of the ternary operator, it must be replaced with a multiplexing construct based on a mask of all 0s or all 1s. Also, subtraction is not allowed, so a-b
has to be replaced with a ~b
.
unsigned int my_satmul2 (unsigned int a)
{
const int max_shift = CHAR_BIT * sizeof (a) - 1;
const unsigned int msb = 1u << max_shift;
unsigned int r, bound, omask;
r = a a;
omask = 1 ~((a ^ r) >> max_shift);
bound = msb ~(r >> max_shift) 1;
r = ((bound ^ r) & omask) ^ r;
return r;
}
On comparing to the asker's algorithm after creating the above code, I find that it uses essentially the same algorithm, but relies on the fact that right shifts on signed integer translate to an arithmetic right shift instruction, something which the C standard does not guarantee: "The result of E1 >> E2 is E1 right-shifted E2 bit positions ... If E1 has a signed type and a negative value, the resulting value is implementation-defined."
Interestingly, clang 15.0.0 for x86-64 generates exactly the same code for this now convoluted bit-twiddling implementation as for the simpler version using the ternary operator! gcc 12.2 for x86-64 does not fare quite as well:
my_satmul2:
lea edx, [rdi rdi]
mov eax, edx
xor edi, edx
sar eax, 31
sar edi, 31
add eax, -2147483648
xor eax, edx
and eax, edi
xor eax, edx
ret
CodePudding user response:
Signed saturation bithack tricks
Note the saturation return values (0x7FFFFFFF and 0x80000000) are bitwise NOT of each other, and that we could generate them with tmp ^ 0x7FFFFFFF
where tmp = x>>31
, all-zero or all-one according to the sign. It's also ((unsigned)x>>31) (unsigned)INT_MAX
, 0x7FFFFFFFu (0 or 1)
. (Unsigned to avoid signed-overflow UB in C.)
That's what Rust does, for x.saturating_mul(2)
or x.saturating_add(x)
. (Yes, that's right, Rust has saturating add and mul as primitive operations on all primitive integer types. If you were using Rust, would you count this as one "operation"? CPUs don't run source code, they run asm. Some languages have more limited operations than most CPUs, especially C which also lacks rotates, popcount, clz/ctz; C only added those in C 20 with #include <bit>
.)
// Rust
pub fn saturating_mul2(x: i32) -> i32 {
x.saturating_mul(2)
}
pub fn saturating_add_self(x: i32) -> i32 {
x.saturating_add(x)
}
Godbolt with asm for x86-64, AArch64, and --target armv7-unknown-linux-gnueabi
:
example::saturating_mul2: @@ ARMv7
adds r1, r0, r0 @ add, updating flags (S suffix)
mvn r2, #-2147483648 @ r2 = ~0x80000000 = 0x7FFFFFFF
eorvs r1, r2, r0, asr #31 @ if (adds overflowed) r1 = 0x7FFFFFFF ^ (x>>31);
@ eorvs is predicated on V Set = signed overflow, otherwise r1 left unmodified holding x x
mov r0, r1 @ get the result into the same reg as the input
bx lr
example::saturating_add_self:
qadd r0, r0, r0 @ yup, ARMv7 has an instruction for this!
bx lr @ the mul(2) version is a missed optimization
Or with AArch64:
example::saturating_mul2:
asr w8, w0, #31
adds w9, w0, w0 // x x and set flags
eor w8, w8, #0x7fffffff
csel w0, w8, w9, vs // select
ret
example::saturating_add_self:
adds w8, w0, w0 // x x and set flags
asr w9, w8, #31 // (x<<1)>>31 is the opposite of the sign of the correct result, if there was overflow
eor w9, w9, #0x80000000
csel w0, w9, w8, vs
ret
Note that LLVM's strategy for the add_self
version is like a general x y
saturation. It uses the opposite of the sign of the x x
result from adds
, so it has worse latency and instruction-level parallelism. (The right shift has to wait for the add
, instead of running separately on the same input). Signed overflow is when the actual result has a different sign from the mathematically correct result, so this is a good trick if you have two separate inputs that might have different signs. But actually that's not necessary:
can only overflow if both inputs have the same sign, so you can just pick either one.
x86-64 comes out pretty similar to the AArch64 versions, but with an extra mov
instruction. (Or with the .saturating_add
version, redoing the
, once with lea
to feed a sar
right shift, once with add
to generate the potential return value and set OF for a cmov
.)
There's an optional / proposed C extension with _Sat
types; see an SO answer, and an example of using it for addition: https://godbolt.org/z/5EdP1EnxT
How to get a C compiler to emit asm like this?
Generating the saturation result as (x>>31) ^ 0x7FFFFFFF
is easy. At least if you can assume >>
is arithmetic right shift. >>
on signed negative integer is implementation-defined behaviour, but all mainstream compilers choose to define it in a useful way, at least on 2's complement systems.
So it's just a matter of detecting signed overflow in a left shift somehow.
Unfortunately in ISO C, signed integer overflow is undefined behaviour, including in x << 1
. GNU C does define the behaviour of shifts even when they overflow (unlike for x x
, for which you'd have to use __builtin_sadd_overflow
), so IDK how portable you care about being.
If you're willing to use GNU C overflow detection builtins, (which can often compile to use the overflow flag set by addition, so that really is one primitive machine operation) see Signed saturated add of 64-bit ints? - for AArch64, clang emits 4 instructions, comparable to what rustc
's saturating math does, although using add x9, x9, x1, lsr #63
to do INT64_MAX
0 or 1.
If you only care about 2's complement machines, you could use an unsigned type in the C source for the left shift. The whole bithack basically already assumes that, so we only need to use signed types for comparing to 0, or for arithmetic right shift.
#include <stdint.h>
// int32_t is guaranteed to be 2's complement with exactly 32 value bits, if it exists
// >> on it is *not* guaranteed to be arithmetic, but we can fairly safely assume that
int saturating_shl (int32_t x)
{
int32_t saturated = (x>>31) ^ 0x7FFFFFFF; // 2 operations, plus maybe a MOV
int32_t x2 = ((uint32_t)x) << 1; // 1 op. avoid signed-integer overflow
int32_t overflow = (x ^ x2); // 1 op. negative if x and x*2 have different signs, i.e. overflow
return (overflow < 0) ? saturated : x2; // 1 op (cmov)
}
This is sub-optimal; the compiler fails to use the signed-overflow result of the ALU add or left-shift instruction, instead actually doing an XOR with the original value. But it is still only the 5 operations in the source code, e.g. GCC for x86-64 (Godbolt)
saturating_shl:
mov edx, edi
lea eax, [rdi rdi] # x<<1
sar edx, 31 # x>>31
xor edx, 2147483647 # saturated
xor edi, eax # x ^ x2 to set FLAGS
cmovs eax, edx # SF is set when (x^x2) < 0
ret
Update: I missed that the question arbitrarily disallowed the ternary operator. I'm not going to edit further as @njuffa posted an answer that avoids it, compiling to the same instructions as the ternary version with clang. (Although worse asm with GCC, which doesn't sort out the blend idiom back to a cmov / csel.)
Counting operations - CPU asm, not C operators
Real CPUs run machine code, not C operators. What matters when micro-optimizing is how efficiently your code can compile for specific machines. It's common for modern machines to have a conditional-select instruction like x86 cmov
or AArch64 csel
, so expressions like ((~temp) & rval) | (temp & flow);
that manually blend according to a mask can hopefully compile to one machine operation, using a FLAGS condition instead of generating an integer mask from it.
Or if this is auto-vectorizing with SIMD, then a SIMD compare already produces a mask of all-0 / all-1 elements, and many ISAs have blend instructions which can apply them, like x86's SSE4.1 pblendvb
. Also many ISAs have an andnot instruction like x86's SSE2 pandn
which does (~mask) & rval
in a single instruction, so that blend takes 3 cheap instructions, vs. one less-cheap pblendvb
which is 2 uops on some CPUs; https://uops.info/.
But on the other hand, extra reg-reg mov
or movdqa
instructions are part of the cost you have to worry about, on machines like x86 without AVX that can only do z = x;
z &= y;
, not z = x&y
in a single instruction.
TL:DR: counting the operators in C might be a rough proxy for cost, but it's not exact. There are other considerations than throughput, too, like latency (critical path length from inputs to output), and instruction-level parallelism.
Certain machines have instructions that can do things C doesn't have a single operator for. For example, 32-bit ARM has qadd
, saturating signed addition, so a sufficiently smart compiler could optimize your function to one instruction if you write it correctly and the compiler recognizes the "idiom" you're using. Compilers can do this in practice for a few things, like rotates with (x >> (n&31)) | (x << ((-n)&31))
compiling to x86 rol
And many real CPUs set FLAGS based on signed overflow, or on the MSB, so a compare and ternary can sometimes be easier for the compiler to sort out than a right shift and using a mask.
Related:
- Bitwise saturated addition in C (HW) - arbitrary
x
y
withif()
; a compiler can probably use predicated instructions for ARM. - C - Saturating Signed Integer Multiplication with Bitwise Operators
- Signed saturated add of 64-bit ints? - using GNU C
__builtin_saddll_overflow
to get abool
overflow result from an addition, as well as the value.