I have a problem about inline-assembly in AArch64, Linux, gcc version is 7.3.0
uint8x16_t vcopyq_laneq_u8_inner(uint8x16_t a, const int b, uint8x16_t c, const int d)
{
uint8x16_t res;
__asm__ __volatile__(
:"ins %[dst].B[%[dlane]], %[src].B[%[sland]] \n\t"
:[dst] "=w"(res)
:"0"(a), [dlane]"i"(b), [src]"w"(c), [slane]"i"(d)
:);
return res;
}
This function used to be a inline function that can be compiled and link to a executable programs. But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand", but the var 'b' and 'd' is constant-var, isn't it?
And now i have an idea to make this function compile successfully, thats use if-else to judge the value of 'b' and 'd', and replace dlane/sland with "immediate integer operand". But in our code, uint8x16_t means a structrue of 16 uint8_t var, so i need coding 16x16==256 if-else statement, thats inefficient.
So my question is following:
- Why this function can be complied and linked successfully to an executable programs with inline properties, but cant not complied to an Dynamic Link Library without inline properties?
- Is there have an efficient way to avoid using 256 if-else statement?
CodePudding user response:
const
means you can't modify the variable, not that it's a compile-time constant. That's only the case if the caller passes a constant, and you compile with optimization enabled so constant-propagation can get that value to the asm statement. Even C constexpr
doesn't require a constant expression in most contexts, it only allows it, and guarantees that compile-time constant-propagation is possible.
A stand-alone version of this function can't exist, but you didn't make it static
so the compiler has to create a non-inline definition that can get called from other compilation units, even if it inlines into every call-site in this file. But this is impossible, because const int b
doesn't have a known value.
For example,
int foo(const int x){
return x*37;
}
int bar(){
return foo(2);
}
On Godbolt compiled for AArch64: notice that foo
can't just return a constant, it needs to work with a run-time variable argument, whatever value it happens to be. Only in bar
with optimization enabled can it inline and not need the value of x
in a register, just return a constant. (Which it used as an immediate for mov
).
foo(int):
mov w1, 37
mul w0, w0, w1
ret
bar():
mov w0, 74
ret
In a shared library, your function also has to be __attribute__((visibility("hidden")))
so it can actually inline, otherwise the possibility of symbol interposition means that the compiler can't assume that foo(123)
is actually going to call int foo(int)
defined in the same .c
(Or static inline
.)
Is there have an efficient way to avoid using 256 if-else statement?
Not sure what you're doing with your vector exactly, but if you don't have a shuffle that can work with runtime-variable counts, store to a 16-byte array can be the least bad option. But storing one byte and then reloading the whole vector will cause a store-forwarding stall, probably similar to the cost on x86 if not worse.
Doing your algorithm efficiently with AArch64 SIMD instructions is a separate question, and you haven't given enough info to figure out anything about that. Ask a different question if you want help implementing some algorithm to avoid this in the first place, or an efficient runtime-variable byte insert using other shuffles.
CodePudding user response:
Constraint "i" means a number. A specific number. It means you want the compiler to emit an instruction like this:
ins v0.B[2], v1.B[3]
(pardon me if me AArch64 assembly syntax isn't quite right) where v0 is the register containing res
, v1 is the register containing c
, 2 is the value of b
(not the number of the register containing b
) and 3 is the value of d
(not the number of the register which containing d
).
That is, if you call
vcopyq_laneq_u8_inner(something, 2, something, 3)
the instruction in the function is
ins v0.B[2], v1.B[3]
but if you call
vcopyq_laneq_u8_inner(something, 1, something, 2)
the instruction in the function is
ins v0.B[1], v1.B[2]
The compiler has to know which numbers b
and d
are, so it knows which instruction you want. If the function is inlined, and the parameters b
and d
are constant numbers, it's smart enough to do that. However, if you write this function in a way where it's not inlined, the compiler has to make an actual function that works no matter what number the b
and d
parameters are, and how can it possibly do that if you want it to use a different instruction depending on what they are?
The only way it could do that is to write all 256 possible instructions and switch between them depending on the parameters. However, the compiler won't do that automatically - you'd need to do it yourself. For one thing, the compiler doesn't know that b
and d
can only go from 0 up to 15.
You should consider either not making this a library function (it's one instruction - doesn't doing a call into a library add overhead?) or else using different instructions where the lane number can be from a register. The instruction ins
copies one vector element to another. I'm not familiar with ARM vector instructions, but there should be some instructions to rearrange or select items in a vector according to a number stored in a register.