Using godbolt.org x86-64 gcc 11.2
, This code...
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
...produces this assembly (when compiled with -O3 -mavx
)...
mul:
vpmulld xmm0, xmm0, xmm1
ret
However this code...
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
struct {int x,y,z,w;}; // this line is the change
int i[4]; // this one too
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
...produces this assembly (when also compiled with -O3 -mavx
)...
mul:
mov QWORD PTR [rsp-40], rdi
mov QWORD PTR [rsp-32], rsi
vmovdqa xmm1, XMMWORD PTR [rsp-40]
mov QWORD PTR [rsp-24], rdx
mov QWORD PTR [rsp-16], rcx
vpmulld xmm0, xmm1, XMMWORD PTR [rsp-24]
vmovdqa XMMWORD PTR [rsp-40], xmm0
mov rax, QWORD PTR [rsp-40]
mov rdx, QWORD PTR [rsp-32]
ret
x86-64 clang 13.0.1
has similar results
So my question is, how can I convince gcc (and/or clang) that these 2 blocks of code should produce the same output?
I've tried __attribute__ ((aligned))
, removing the int i[4];
or the struct
, applying __attribute__ ((packed))
to the struct
, I even gave __attribute__ ((transparent_union))
a go. Whatever magic status __attribute__ ((vector_size (16)))
bestows is broken by adding anything to the union
.
CodePudding user response:
I should say that I have never worked with this attribute personally and I checked the gcc just now, but from document I saw something that I think will be useful for your problem.
From your code, I can assume that you want to use union
to access each int
of vector separately. But if it is the only reason, it is not necessary to use int[4]
or struct {int x,y,z,w;};
as part of union, because vectors can be used like arrays themselves:
typedef int v4i __attribute__ ((vector_size (16)));
typedef union {
v4i v;
} int4;
int4 mul(int4 l, int4 r)
{
int4 ret = (int4){.v=l.v * r.v};
printf("%i %i %i %i", ret.v[0], ret.v[1], ret.v[2], ret.v[3]);
return ret;
}
and the code will be optimized as you like. In addition, if you need byte level access, union
with another vector works as you like too:
typedef int v4i __attribute__ ((vector_size (16)));
typedef unsigned char v4b __attribute__ ((vector_size (16)));
struct i4s{int x,y,z,w;};
typedef union {
v4i v;
v4b v2;
} int4;
int4 mul(int4 l, int4 r)
{
return (int4){.v=l.v * r.v};
}
It seems that union
will work with primitive-like types in this case. for example even __m128i
works too.
CodePudding user response:
Turns out, they are the same. For some reason the second one includes the populating the xmm? registers from the stack, but if for example one adds a main function...
int main(int argc, char *argv[])
{
// volatile keyword added so they don't get optimised out
volatile int4 x = {.v={1,2,3,4}};
volatile int4 y = {.v={1,2,3,4}};
int4 z = mul(x, y);
return z.v[0];
}
...then the function (or single vpmulld
instruction in this case) gets inlined, and different, appropriate stack manipulation gets inserted.