Home > Software engineering >  __attribute__ ((vector_size)) magic being destroyed by union
__attribute__ ((vector_size)) magic being destroyed by union

Time:03-17

Using godbolt.org x86-64 gcc 11.2, This code...

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

...produces this assembly (when compiled with -O3 -mavx)...

mul:
        vpmulld xmm0, xmm0, xmm1
        ret

However this code...

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
    struct {int x,y,z,w;}; // this line is the change
    int i[4]; // this one too
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

...produces this assembly (when also compiled with -O3 -mavx)...

mul:
        mov     QWORD PTR [rsp-40], rdi
        mov     QWORD PTR [rsp-32], rsi
        vmovdqa xmm1, XMMWORD PTR [rsp-40]
        mov     QWORD PTR [rsp-24], rdx
        mov     QWORD PTR [rsp-16], rcx
        vpmulld xmm0, xmm1, XMMWORD PTR [rsp-24]
        vmovdqa XMMWORD PTR [rsp-40], xmm0
        mov     rax, QWORD PTR [rsp-40]
        mov     rdx, QWORD PTR [rsp-32]
        ret

x86-64 clang 13.0.1 has similar results

So my question is, how can I convince gcc (and/or clang) that these 2 blocks of code should produce the same output?

I've tried __attribute__ ((aligned)), removing the int i[4]; or the struct, applying __attribute__ ((packed)) to the struct, I even gave __attribute__ ((transparent_union)) a go. Whatever magic status __attribute__ ((vector_size (16))) bestows is broken by adding anything to the union.

CodePudding user response:

I should say that I have never worked with this attribute personally and I checked the gcc just now, but from document I saw something that I think will be useful for your problem.

From your code, I can assume that you want to use union to access each int of vector separately. But if it is the only reason, it is not necessary to use int[4] or struct {int x,y,z,w;}; as part of union, because vectors can be used like arrays themselves:

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
} int4;

int4 mul(int4 l, int4 r)
{
    int4 ret = (int4){.v=l.v * r.v};
    printf("%i %i %i %i", ret.v[0], ret.v[1], ret.v[2], ret.v[3]);
    return ret;
}

and the code will be optimized as you like. In addition, if you need byte level access, union with another vector works as you like too:

typedef int v4i __attribute__ ((vector_size (16)));
typedef unsigned char v4b __attribute__ ((vector_size (16)));

struct i4s{int x,y,z,w;};

typedef union {
    v4i v;
    v4b v2;
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

It seems that union will work with primitive-like types in this case. for example even __m128i works too.

CodePudding user response:

Turns out, they are the same. For some reason the second one includes the populating the xmm? registers from the stack, but if for example one adds a main function...

int main(int argc, char *argv[])
{
    // volatile keyword added so they don't get optimised out   
    volatile int4 x = {.v={1,2,3,4}};
    volatile int4 y = {.v={1,2,3,4}};
    int4 z = mul(x, y);
    
    return z.v[0];
}

...then the function (or single vpmulld instruction in this case) gets inlined, and different, appropriate stack manipulation gets inserted.

  • Related