I want to add 2 unsigned vectors using AVX2
__m256i i1 = _mm256_loadu_si256((__m256i *) si1);
__m256i i2 = _mm256_loadu_si256((__m256i *) si2);
__m256i result = _mm256_adds_epu16(i2, i1);
however I need to have overflow instead of saturation that _mm256_adds_epu16
does to be identical with the non-vectorized code, is there any solution for that?
CodePudding user response:
Use normal binary wrapping _mm256_add_epi16
instead of saturating adds
.
Two's complement and unsigned addition/subtraction are the same binary operation, that's one of the reasons modern computers use two's complement. As the asm manual entry for vpaddw
mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn't mention signedness at all, so is less helpful at clearing up this confusion.)
Compares like _mm_cmpgt_epi32
are sensitive to signedness, but math operations (and cmpeq
) aren't.
The intrinsics names Intel chose might look like they're for signed integers specifically, but they always use epi
or si
for things that work equally on signed and unsigned elements. But no, epu
implies a specifically unsigned thing, while epi
can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.
For example, _mm_and_si128
is pure bitwise. _mm_srli_epi32
is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that's _mm_srai_epi32
(shift right arithmetic by immediate). Shuffles like _mm_shuffle_epi32
just move data around in chunks.
Non-widening multiplication like _mm_mullo_epi16
and _mm_mullo_epi32
are also the same for signed or unsigned. Only the high-half _mm_mulhi_epu16
or widening multiplies _mm_mul_epu32
have unsigned forms as counterparts to their specifically signed epi16
/32
forms.
That's also why 386 only added a scalar integer imul ecx, esi
form, not also a mul ecx, esi
, because only the FLAGS setting would differ, not the integer result. And SIMD operations don't even have FLAGS outputs.
The intrinsics guide unhelpfully describes _mm_mullo_epi16
as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for pmullw
also describes it as signed that way, it seems talking about it as the companion to signed pmulhw
. (And has some bugs, like describing the AVX1 VPMULLW xmm1, xmm2, xmm3/m128
form as multiplying 32-bit dword elements, probably a copy/paste error from pmulld
)
And sometimes Intel's naming scheme is limited, like _mm_maddubs_epi16
is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for pmaddubsw
to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for pmaddwd
.