Using LDRD in GNU C inline asm? What constraints to use?-CodePudding

TL;DR I'm playing around with easm and burned my fingers. Do my constraints make sense?

As I am playing around with memory, I wanted to test reading some memory manually on an ARM CPU (cortex A9)

(Disclaimer: Learning purpose here, of course I agree that relying on an optimizer is 99.999% of the time the right thing to do but I would really like to understand why everything explodes here).

On the concerned hardware:

is the bus CPU - Memory 64 bits wide, so I'm trying to use the ldrd instruction to load two 32b words at once.
The data in memory is 128 bits aligned, so, let's use two times the ldrd instruction.

My problem is that, the generated assembler / generation attempt does not make sense, and this independently from:

Compiler (tested with GCC and clang)
Optimization level (tested with -O0 -Og -O2 -O3)
Cross / native (tested with arm-linux-gnueabihf-gcc and native gcc)

Here is a minimal example demonstrating the issue:

#include <stdint.h>


// custom structure: represent 128 bits
typedef struct __attribute__ ((packed)) u128
{
  uint32_t a;
  uint32_t b;
  uint32_t c;
  uint32_t d;
} u128;



int main(void)
{
  uint32_t *ptr = (uint32_t*) 0xdeadbeef; // For test purpose, just a random location in memory
  u128 words;

  // 1st read: 64 bits
  asm volatile inline (
    "ldrd %[high_32b], %[low_32b], [%[addr]], #8"
    : [high_32b] "=X" (words.a), [low_32b] "=X" (words.b)
    : [addr] "r" (ptr));

  // 2nd read: 64 bits
  asm volatile inline (
    "ldrd %[high_32b], %[low_32b], [%[addr]], #8"
    : [high_32b] "=X" (words.c), [low_32b] "=X" (words.d)
    : [addr] "r" (ptr));

  return 0;
}

GCC

arm-linux-gnueabihf-gcc -Wall -Wextra -O3 -g -ggdb broken_asm.c -o broken_asm /tmp/ccIaxiTz.s: Assembler messages: /tmp/ccIaxiTz.s:51: Warning: base register written back, and overlaps one of transfer registers

disassembly (radare2 -A -c 's sym.main; pdf' broken_asm)

│ 0x000003da f3e80221 ldrd r2, r1, [r3], 8
| 0x000003de f3e80232 ldrd r3, r2, [r3], 8 ; broken_asm.c:27 asm volatile inline (

So, yes indeed, the warning makes sense: The ldrd r3, r2, [r3], 8 seems broken

(expected: sources != destination. For instance: ldrd r3, r2, [r4], 8)

Clang

clang -mtune=cortex-a9 --target=arm-linux-gnueabihf -isystem /usr/arm-linux-gnueabihf/include -Wall -Wextra -O3 -g -ggdb broken_asm.c -o broken_asm

broken_asm.c:22:5: error: Rt must be even-numbered "ldrd %[high_32b], %[low_32b], [%[addr]], #8" ^ :1:11: note: instantiated into assembly here ldrd r1, r2, [r0], #8 ^ broken_asm.c:28:5: error: base register needs to be different from destination registers "ldrd %[high_32b], %[low_32b], [%[addr]], #8" ^ :1:11: note: instantiated into assembly here ldrd r0, r1, [r0], #8 ^ 2 errors generated.

So, let's read some error messages:

base register needs to be different from destination registers

OK, comparable issue as with GCC (and yes, it more feel like an error than a warning)

error: Rt must be even-numbered

Wait what? ldrd r1, r2 ... The first operand must, indeed be an even register and the second one, the following odd register.

From the ARM Instructions Reference:

Rt: The first destination register. For an ARM instruction, must be even-numbered and not R14.

Rt2: The second destination register. For an ARM instruction, must be <R(t 1)>.

I am pretty sure I'm doing something in EASM wrong (as it's actually nearly the only effective lines of code, it's not so hard to guess).

Here is my constraints understanding so far:

Output:

The registers if which I would like the output are, as far as I understand, write only.

‘=’ identifies an operand which is only written

I started with "g" as a constraint (same effect) but opted for "X" to give the might compiler more freedom:

'X' Any operand whatsoever is allowed.

Input:

I'm using "r" as I would like in both ldrd to read from the same register. I also tried with "X" but got the same issue.

'r' A register operand is allowed provided that it is in a general register.

Some notes as this post is too short :/

Host: Linux (Debian)
Target: Zynq 7000 (PS side: Cortex A9)
Clang --version: Debian clang version 11.0.1-2
cross gcc: arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110
native gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110
Tweaking a binary to manually set registers in op-codes seems to work as intended

So, I genuinely have no idea what I'm doing wrong here. Any pointer welcomed.

CodePudding user response：

GCC generally supports the same inline assembly features as armclang, though unfortunately the GCC manual does not document them. In the armclang docs, you can read:

If you use a 64-bit value as an operand to an inline assembly statement in A32 or 32-bit T32 instructions, and you use the r constraint code, then an even/odd pair of general purpose registers is allocated to hold it. This register allocation is not guaranteed for the l or h constraints.

Using the r constraint code enables the use of instructions like LDREXD/STREXD, which require an even/odd register pair. You can reference the registers holding the most and least significant halves of the value with the Q and R template modifiers.

So loading two registers with ldrd could look like:

#include <stdint.h>

uint64_t get_pair(void *ptr) {
    uint64_t result;
    asm("ldrd %Q[pair], %R[pair], [%[addr]]"
        : [pair] "=r" (result)
        : [addr] "r" (ptr)
        : "memory");
    return result;
}

Try on godbolt

If you want to extract the two halves separately, you can follow the asm block with something like

uint32_t lo, hi;
lo = result; // conversion truncates
hi = result >> 32;

With optimizations enabled, you can be confident that the compiler will just store the high-half register and not actually execute a shift. This is a common idiom that compilers recognize.

There are a couple other issues with the code in your question:

You are using the writeback post-increment addressing mode which modifies your address register, but you do not inform the compiler of that. You would need to make your addr an input-output operand: list it with the outputs and use the r constraint. But keep in mind that this is pointless unless you actually use the updated value later in the code; if not, then just use a non-writeback addressing mode.
By default the compiler assumes your asm statement does not read any memory, so memory writes could be reordered past it. asm volatile does not prevent this; it only prevents the compiler from omitting the asm entirely when it thinks its outputs are unused.

A memory clobber as in my example above is the simplest and crudest way to do this; it tells the compiler that the asm statement may read or write arbitrary parts of memory, and so no other memory accesses may be reordered past it. Better is an m input operand, with a variable whose type has the size that is to be read; I won't bother with it here, but see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more information.

Rather than having your reads in two separate asm statements and using the writeback mode to change the pointer between them, I would put them in a single asm statement and skip the writeback addressing altogether. Here's my attempt (try on godbolt):

  uint64_t hi64, lo64; 
  asm inline("ldrd %Q[lo], %R[lo], [%[addr]] \n\t"
             "ldrd %Q[hi], %R[hi], [%[addr], #8]"
             : [lo] "=&r" (lo64), [hi] "=r" (hi64)
             : [addr] "r" (ptr)
             : "memory");
  words.a = lo64;
  words.b = lo64 >> 32;
  words.c = hi64;
  words.d = hi64 >> 32;

Note the "earlyclobber" & modifier on the lo operand, indicating that it is written before all the inputs are read. Without this, the compiler might use one of the lo registers for addr, in which case it would be overwritten by the first ldrd instruction, and the second one would break. However, we did not use & on hi; since addr is not needed after hi is written, it is okay if they use the same register.