Home > OS >  operand type mismatch for `lddqu' with an __int128 "=r" destination
operand type mismatch for `lddqu' with an __int128 "=r" destination

Time:12-31

I need to move 128 bit value from the address [rsi - 0x80] to the dest variable below using instruction lddqu, and I am encountering the error "operand type mismatch for `lddqu'". I know there are previous questions on stackoverflow using lower operand sizes but what suffix should I use with the instruction to be able to get the value at that address in the variable.

 __int128 dst = 0, src = 0;
asm volatile ("lddqu -0x80(%%rsi), %0\n\t"
        : "=r" (dst)
        : "r" (src));

Just to give an overview of the entire problem, this is only one instruction that is part of a larger graph algorithm that finds shortest path between two vertices. src variable is redundant and can be removed if it adds ambiguity. I am designing a hardware prefetcher (in a processor simulator) to predict future memory addresses based on the currently accessed addresses. Once I can get an address in a variable like dst, I have a technique that automatically predicts the future address and triggers the memory request for that address.

A larger version of the pattern is a sequence of loads and store, and looks like this:

  lddqu  xmm0,[rsi-0x80]
  movdqu XMMWORD PTR [rdi-0x80],xmm0
  lddqu  xmm0,[rsi-0x70]
  movdqu XMMWORD PTR [rdi-0x70],xmm0
  lddqu  xmm0,[rsi-0x60]
  movdqu XMMWORD PTR [rdi-0x60],xmm0
  lddqu  xmm0,[rsi-0x50]
  movdqu XMMWORD PTR [rdi-0x50],xmm0

Now, I am working on how to get the inline asm working with Intel syntax.

CodePudding user response:

lddqu can only load into a vector register, not a general-purpose register. Use =x in place of =r for dst's constraint.

Also, your source looks suspicious, since you're ignoring src and just loading from an arbitrary offset of a register that you know nothing about the content of.

Look at the compiler-generated asm around your asm statement to see how the compiler gets __int128 dst back into memory or integer registers after you force it to be in an XMM register, for example on https://godbolt.org/, especially with -O2 optimization enabled.

Using inline asm like this is probably even worse for efficiency than using SSE intrinsics like _mm_loadu_si128 - See also https://gcc.gnu.org/wiki/DontUseInlineAsm

  • Related