Memory ordering for a spin-lock "call once" implementation-CodePudding

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if

a) it could be faster in the common case (i.e. already initialized), and,

b) is the selected memory ordering strong enough / too strong?

Architectures of interest are x86_64 (primarily) and aarch64.

The intended use API is something like this

void gets_called_many_times_from_many_threads(void)
{  
    static int my_once_flag = 0;
     
    if (once_enter(&my_once_flag)) {
        // do one-time initialization here
        once_commit(&my_once_flag);
    }

    // do other things that assume the initialization has taken place
}

And here is the implementation:

int once_enter(int *b)
{
    int zero = 0;
    int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
    if (got_lock) return 1;

    while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
        // on x86, insert a pause instruction here
    };
    return 0;
}

void once_commit(int *b)
{
    (void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}

I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.

Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?

UPDATE

Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:

I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).

#ifdef __x86_64__
#define PAUSE()  __asm __volatile("pause")
#else
#define PAUSE()
#endif

int once_enter(int *b)
{
    if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;

    int zero = 0;
    int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
    if (got_lock) return 1;

    while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
        PAUSE();
    };
    return 0;
}

void once_commit(int *b)
{
    (void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}

CodePudding user response：

a, What you need is a double-checked lock.

Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.

void gets_called_many_times_from_many_threads(void)
{  
    static int my_once_flag = 0;

    if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
        if (once_enter(&my_once_flag)) {
            // do one-time initialization here
            once_commit(&my_once_flag);
        }
    }

    // do other things that assume the initialization has taken place
}

b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.