Is static initialization atomic across all objects?-CodePudding

C 11 guarantees that the initialization of static local variables is atomic at the first call of the function. Although the standard doesn't mandate any implementation, the only way to handle this efficiently is double-checked locking.
I asked myself if all objects are initialized are initialized across the same mutex (likely) or if each static object initialization acts on its own mutex (unlikely). So I wrote this litlte C 20-program that uses some variadic and fold expression tricks to have a number of different functions that each initialize their own static object:

#include <iostream>
#include <utility>
#include <latch>
#include <atomic>
#include <chrono>
#include <thread>

using namespace std;
using namespace chrono;

atomic_uint globalAtomic;

struct non_trivial_t
{
    non_trivial_t() { ::globalAtomic = ~::globalAtomic; }
    non_trivial_t( non_trivial_t const & ) {}
    ~non_trivial_t() { ::globalAtomic = ~::globalAtomic; }
};

int main()
{
    auto createNThreads = []<size_t ... Indices>( index_sequence<Indices ...> ) -> double
    {
        constexpr size_t N = sizeof ...(Indices);
        latch latRun( N );
        atomic_uint synch( N );
        atomic_int64_t nsSum( 0 );
        auto theThread = [&]<size_t I>( integral_constant<size_t, I> )
        {
            latRun.arrive_and_wait();
            if( synch.fetch_sub( 1, memory_order_relaxed ) > 1 )
                while( synch.load( memory_order_relaxed ) );
            auto start = high_resolution_clock::now();
            static non_trivial_t nonTrivial;
            nsSum.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
        };
        (jthread( theThread, integral_constant<size_t, Indices>() ), ...);
        return (double)nsSum / N;
    };
    constexpr unsigned N_THREADS = 64;
    cout << createNThreads( make_index_sequence<N_THREADS>() ) << endl;
}

I create 64 threads with the above code since my system has up to 64 CPUs in a processor group (Ryzen Threadripper 3990X, Windows 11). The results fulfilled my expectations in a way that each initialization is reported to take about 7.000ns. If each initialization would act on its own mutex the mutex locks would take the short path and you'd have no kernel-contention and the times would be magnitudes lower. So are there any further questions ?

The question I asked myself afterwards is: what happens if the constructor of the static object has its own static object ? Does the standard explicitly mandate that this should work, forcing the implementation to consider that the mutex has to be recursive ?

CodePudding user response：

No, static initialization is not atomic across all objects. Different static objects may get initialized by different threads simultaneously.

It just so happens that GCC and Clang do in fact use a single global recursive mutex (to handle the recursive case you described, which is required to work), but other compilers use a mutex for every static function-local object (i.e. Apple's compiler). Therefore you can't rely on static initialization happening one object at a time - simply because it doesn't, depending on your compiler (and the version of that compiler).

Section 6.7.4 of the standard:

A local object of POD type (basic.types) with static storage duration initialized with con- stant-expressions is initialized before its block is first entered. An implementation is permitted to perform early initialization of other local objects with static storage duration under the same conditions that an implementation is permitted to statically initialize an object with static storage duration in namespace scope (basic.start.init).
Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.

The standard only forbids recursive initialization of the same static object; it doesn't forbid the initialization of one static object to require another static object to be initialized. Since the standard explicitly states that all static objects that don't fall in this forbidden category must be initialized when the block containing them is first executed, the case you asked about is allowed.

int getInt1();
int getInt2() { //This could be a constructor, too, and nothing would change
  static int result = getInt1();
  return result;
}
int getInt3() {
  static int result = getInt2(); //Allowed!
  return result;
}

This also applies to the case when the constructor of a function-local static object itself contains such a static object. A constructor is really just a function too, which means this case is identical to the example above.

CodePudding user response：

Every static local variable has to be atomic. If every single one of them has it's own mutex or double-checked locking then that will be true.

There could also be a single global recursive mutex that allows one thread and one thread only to be initializing static local variables at a time. That works too. But if you have many static local variables and multiple threads accessing them for the first time then that could be horribly slow.

But lets consider your case of a static local variable having a static local variable:

class A {
    static int x = foo();
};

void bla() {
    static A a;
};

Initializing a requires initializing x. But nothing says there can't be some other thread that also has an A c; and will be initializing x at the same time. So x still needs to be protected even though in the case of bla() it is inside an already static initialization.

Another example (hope that compiles, haven't checked):

void foo() {
    static auto fn = []() {
        static int x = bla();
    };
}

Here x can only ever be initialized when fn is initialized. So the compiler could possibly skip protecting x. That would be an optimization that follows the as-if principal. Apart from timing there is no difference whether x is protected or not. On the other hand locking for x would always succeed and the cost of that is very small. Compilers might not optimize it because nobody invested the time to detect and optimize such cases.