Manipulate vector register as float32x4_t C variable in ARM-CodePudding

I'm using inline assembly in ARM for a scientific application. In my assembly code, I have to (see note in the end) nominally indicate which vector registers I want to use. For example, in my code, I have asm volatile("fadd v12.4S, v12.4S, v7.4S") to do a vector floating-point add between vector registers 7 and 12, storing the result in vector register 12, among other inline assembly instructions.

After the 'critical' assembly code part, I want to retrieve the said resulting variables and operate on them as arm neon variables in C. In my case, vectors will have 4x 32-bit variables, so they will be of type float32x4_t.

So far I can do something like:

float32_t my_var[4];
asm volatile("st1  {v12.4S}, [%[addr]]\n\t" : : [addr]"r"(my_var) :  "x0",  "x1");
/*from here on I can operate on my_var[0], my_var[1], etc without having to write asm code*/

I.e., I'm using a vector store instruction to write the contents of the vector register into a C vector variable. This will cause subsequent accesses to that variable to be loads, which I want to avoid because the variable exists in a register already.

I'd like to have something similar to

float32x4_t my_var;
asm volatile("some code that make sure my_var 'binds' to vector 12");
/*from here on I could use intrinsic such as vgetq_lane_f32(my_var, 1) to get each value of the vector and not having to write asm code also*/

However, I could not find a way to do the second approach. This old question had similar concerns, but it was for an older ARM ISA (I'm targeting v8), and to load from (not store to) a single (not vector) variable.

Note: I cannot use intrinsic calls from the beginning (which would make things much easier), because I'm modeling new instructions in a simulator, and I need to write low-level assembly up to that part.

CodePudding user response：

You can use the w machine constraint to pass SIMD registers as operands to an inline assembly statement. This causes the compiler to pick a SIMD register for you.

float32x4_t add(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

Note that the compiler is permitted to overwrite arbitrary registers inbetween inline assembly statements. It is not possible to prevent this.

You can however tell the compiler which SIMD register to use for an operand using local register variables. This does not guarantee that the variable will reside in the indicated register at all times, but it is at least guaranteed right before and after each inline asm statement in which that variable is listed as an operand (see the manual for details).

float32x4_t add(float32x4_t a_, float32x4_t b_)
{
    register float32x4_t c asm ("v12");
    register float32x4_t a asm ("v12") = a_;
    register float32x4_t b asm ("v4") = b_;

    asm ("fadd %0.4s, %1.4s, %2.4s" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

Theoretically it should also be possible to use arithmetic to build the correct opcode at assembly time, but there does not seem to be a way to get gcc to print the register number it chose without any decoration (boo!). Suppose there was such a template modifier X, such code could look like this:

float32x4_t add3(float32x4_t a, float32x4_t b)
{
    float32x4_t c;

    asm (".inst 0x4e20d40   %X0   (%X1<<5)   (%X2<<16)" : "=w"(c) : "w"(a), "w"(b));

    return c;
}

It might be worth patching support for such a thing into your local gcc/clang build if you need this feature.