I have the following code snippet (a gist can be found here) where I am trying to do a sum between 4 int32_t negative values and 4 int16_t values (that will be sign extend to int32_t).
extern exit
global _start
section .data
a: dd -76, -84, -84, -132
b: dw 406, 406, 406, 406
section .text
_start:
movdqa xmm0, [a]
pmovsxwd xmm2, [b]
paddq xmm0, xmm2
;Expected: 330, 322, 322, 274
;Results: 330, 323, 322, 275
call exit
However, when going through my debugger, I couldn't understand why the output results are different from the expected results. Any idea ?
CodePudding user response:
paddq
does 64-bit qword chunks, so there's carry across two of the 32-bit boundaries, leading to an off-by-one in the high half of each qword.
paddd
is 32-bit dword chunks, matching the pmovsxwd
dword element destination size. This is a SIMD operation with 4 separate adds, independent of each other.
BTW, you could have made this more efficient by folding the 16-byte aligned load into a memory operand for padd
, but yeah for debugging it can help to see both inputs in registers with a separate load.
default rel ; use RIP-relative addressing modes when possible
_start:
movsxwd xmm0, [b]
paddd xmm0, [a]
Also you'd normally put read-only arrays in section .rodata
.