Is it a violation of strict aliasing to cast to a "super-class" and back in C?-CodePudding

I'm trying to figure out if doing mock sub-classing in C where the super-class struct is included wholesale in the sub-class struct, not just the sub-class and super-class having the same prefix of members.

In the example code below, I've tried to lay out what my thinking is:

#include "stdlib.h"
#include "stdio.h"
#include "string.h"

enum type {
    IS_A,
    IS_B,
};

struct header {
    enum type type;
};

struct a {
    struct header hdr;
    float x;
};

struct b {
    struct header hdr;
    int y;
};

void do_with_a(struct header *obj) {
    if (obj->type == IS_A) {
        struct a *a = (struct a *)obj;
        printf("%f\n", a->x);
    }
}

void do_with_b(struct header *obj) {
    // Oops forgot to check the type tag
    struct b *b = (struct b *)obj;
    printf("%d\n", b->y);
}

int main() {
    struct a *a = malloc(sizeof(*a));

    a->hdr.type = IS_A;
    a->x = 3.0;

    do_with_a(&a->hdr);
    do_with_b(&a->hdr);
}

I'm reasonably certain that do_with_b() will always be undefined behavior if called with an "a". My primary question is whether do_with_a() is always defined behavior (assuming I've set the type tag correctly) or if this is setting myself up for disaster when the compiler authors change their minds, or improve their analysis.

As a sub-question: I believe that converting an struct a * to a struct header * by &ap->hdr or (struct header *)ap would both be well-defined, is this the case?

Looking at the C11 standard, there seem to be two relevant passages, one in section 6.7.2.1 paragraph 15:

Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared. A pointer to a structure object, suitably converted, points to its initial member...

And one in 6.5 paragraph 7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,

...

an aggregate or union type that includes one of the aforementioned types among its members...

Between these it's not clear to me if this is the intended interpretation of the standard, or if I'm being too hopeful.

I've tried the above code compiled in both GCC and clang, and neither seem to behave differently with optimizations on vs off. GCC, however, does signal warnings about both down-casts when set to -Wstrict-aliasing=1. The language is somewhat vague, saying that it "might" break strict aliasing, where the description of that flag indicates that false-positives are quite common, so this is inconclusive:

undefined_test.c: In function ‘do_with_a’:
undefined_test.c:26:39: warning: dereferencing type-punned pointer might break strict-aliasing rules [-Wstrict-aliasing]
   26 |                 struct a *a = (struct a *)obj;
      |                                       ^
undefined_test.c: In function ‘do_with_b’:
undefined_test.c:33:31: warning: dereferencing type-punned pointer might break strict-aliasing rules [-Wstrict-aliasing]
   33 |         struct b *b = (struct b *)obj;
      |

The very end of the accepted answer just about answers the question, but does not seem satisfactory to me. Most of the comments on it seem to mainly refer to the case of the "common-prefix" rather than "nested-struct" case. It's not clear to me that the "nested-struct" case has been sufficiently defended.

CodePudding user response：

My primary question is whether do_with_a() is always defined behavior

As a sub-question: I believe that converting an struct a * to a struct header * by &ap->hdr or (struct header *)ap would both be well-defined, is this the case?

It's well-defined as per the initial member rule C17 6.7.2.1/15 emphasis mine:

Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared. A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.

This is also consistent with the effective type/strict aliasing rule you quote in 6.5 §6 and §7
I'm reasonably certain that do_with_b() will always be undefined behavior

Yes, it's not a compatible struct. So it's a strict aliasing violation and possibly also an alignment problem. Note however that the strict aliasing rule is compatible with an oddball rule called "common initial sequence", which in this case would allow you to inspect the header part of b. C17 6.5.2.3/6:

One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the completed type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.

That is, if you add something like typedef union { struct a a_; struct b b_; } dummy; to the translation unit, you would be allowed to inspect the header part of each struct in a well-defined manner. But not that compilers might have shaky standard compliance when it comes to implementing this feature and there were some defect reports about it to the committee (I'm unsure of its status as per C23).
GCC, however, does signal warnings about both down-casts when set to -Wstrict-aliasing=1

These options in gcc have status somewhere between broken and very broken. -fno_strict_aliasing to disable it entirely is reliable however.

The strict aliasing rule itself got many flaws: for example the effective type of the memory you allocated is actually a struct header and a float, not a struct a, because you didn't write an lvalue of type struct a to the memory returned by malloc. Similarly, given that we allocate a chunk of memory with malloc then treat it as an array of type by initializing it in a for loop, then we actually don't have the effective type type[] but individual objects. But if implemented like that the whole C language unravels.