Semaphore latency faster than expected

Acquisition of a semaphore is done by blocking. According to the internet and clockres, the interrupt frequency / timer interval on Windows shouldn't be under 0.5ms. The program below measures the time between the release and acquisition of a semaphore in different threads. I would not expect this to be faster than 0.5ms, yet I reliably get results of ~0.017ms. (Curiously with a high standard deviation of - 100%)

Either my measurement code is wrong, or my understanding of how semaphores work is. Which is it? Code without the boring code to calculate mean and standard deviation:

namespace {
   std::binary_semaphore semaphore{ 0 };
   std::atomic<std::chrono::high_resolution_clock::time_point> t1;
}

auto acquire_and_set_t1() {
   semaphore.acquire(); // this is being measured
   t1.store(std::chrono::high_resolution_clock::now());
}

auto measure_semaphore_latency() -> double {
   std::jthread j(acquire_and_set_t1);
   std::this_thread::sleep_for(5ms); // To make sure thread is running

   // Signal thread and start timing
   const auto t0 = std::chrono::high_resolution_clock::now();
   semaphore.release();

   std::this_thread::sleep_for(5ms); // To make sure thread is done writing t1

   const double ms = std::chrono::duration_cast<std::chrono::nanoseconds>(t1.load() - t0).count() / 1'000'000.0;
   return ms;
}

auto main() -> int {
   std::vector<double> runtimes;
   for (int i = 0; i < 100;   i)
      runtimes.emplace_back(measure_semaphore_latency());

   const auto& [mean, relative_std] = get_mean_and_std(runtimes);
   std::cout << std::format("mean: {:.3f} ms,  - {:.2f}%\n", mean, 100.0 * relative_std);
}

edit: sources for windows timer resolution is https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/ and ClockRes

CodePudding user response：

Your confusion is coming from the faulty assumption that this comes into play:

According to the internet and clockres, the interrupt frequency / timer interval on Windows shouldn't be under 0.5ms.

Preemptive / timer-based scheduling does not have to be the only opportunity for an OS to assign threads to CPU cores. Explicit/Manual signalling can bypass it.

You can think of it as std::binary_semaphore::release() triggering an immediate partial run of the scheduler only targeting threads that happen to have a std::binary_semaphore::acquire() on the same semaphore.

This is what's happening here. The measure_semaphore_latency() thread is being woken up and potentially assigned to a CPU core immediately as the release() call is made, without waiting for the next scheduling "cycle".

It's still not guaranteed that the OS will choose to preempt anything for that just-woken-up thread. This is where the high standard deviation you are seeing comes from: The thread either gets immediate CPU time, or gets it at a later scheduling cycle, there's no in-between.

As to why I can be so confident that this is the case in your test: With a bit of debugging and symbols loading, we can get the following call stacks:

On the acquire side:

    ntdll.dll!00007fffa4510764()    Unknown
    ntdll.dll!00007fffa44d379d()    Unknown
    ntdll.dll!00007fffa44d3652()    Unknown
    ntdll.dll!00007fffa44d3363()    Unknown
    KernelBase.dll!00007fffa225ce9f()   Unknown
>   msvcp140d_atomic_wait.dll!`anonymous namespace'::__crtWaitOnAddress(volatile void * Address, void * CompareAddress, unsigned __int64 AddressSize, unsigned long dwMilliseconds) Line 174    C  
    msvcp140d_atomic_wait.dll!__std_atomic_wait_direct(const void * _Storage, void * _Comparand, unsigned __int64 _Size, unsigned long _Remaining_timeout) Line 234 C  
    ConsoleApplication2.exe!std::_Atomic_wait_direct<unsigned char,char>(const std::_Atomic_storage<unsigned char,1> * const _This, char _Expected_bytes, const std::memory_order _Order) Line 491  C  
    ConsoleApplication2.exe!std::_Atomic_storage<unsigned char,1>::wait(const unsigned char _Expected, const std::memory_order _Order) Line 829 C  
    ConsoleApplication2.exe!std::counting_semaphore<1>::acquire() Line 245  C  
    ConsoleApplication2.exe!acquire_and_set_t1() Line 17    C  
    ConsoleApplication2.exe!std::invoke<void (__cdecl*)(void)>(void(*)() && _Obj) Line 1586 C  
    ConsoleApplication2.exe!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals) Line 56  C  
    ucrtbased.dll!00007fff4c7b542c()    Unknown
    kernel32.dll!00007fffa2857034() Unknown
    ntdll.dll!00007fffa44c2651()    Unknown

On the release side:

    ntdll.dll!00007fffa44d2550()    Unknown 
>   msvcp140d_atomic_wait.dll!`anonymous namespace'::__crtWakeByAddressSingle(void * Address) Line 179  C  
    msvcp140d_atomic_wait.dll!__std_atomic_notify_one_direct(const void * _Storage) Line 251    C  
    ConsoleApplication2.exe!std::_Atomic_storage<unsigned char,1>::notify_one() Line 833    C  
    ConsoleApplication2.exe!std::counting_semaphore<1>::release(const __int64 _Update) Line 232 C  
    ConsoleApplication2.exe!measure_semaphore_latency() Line 29 C  
    ConsoleApplication2.exe!main() Line 36  C  
    ConsoleApplication2.exe!invoke_main() Line 79   C  
    ConsoleApplication2.exe!__scrt_common_main_seh() Line 288   C  
    ConsoleApplication2.exe!__scrt_common_main() Line 331   C  
    ConsoleApplication2.exe!mainCRTStartup(void * __formal) Line 17 C  
    kernel32.dll!00007fffa2857034() Unknown
    ntdll.dll!00007fffa44c2651()    Unknown

Poking around the code of __crtWakeByAddressSingle(), and __crtWaitOnAddress() (see on github) we find that the invoked kernel functions are WaitOnAddress() ref and WakeByAddressSingle() ref.

From this documentation, we find our confirmation in the remarks section of WaitOnAddress():

WaitOnAddress does not interfere with the thread scheduler.