Home > Software engineering >  Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)
Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)

Time:05-22

I'm trying to benchmark the fast inverse square root. The full code is here:

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

and here is the code in quick-bench if you want to run it yourself.

Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.

Here is the relevant Assembly (taken from quick-bench).

With GCC:

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    $0x18,%rsp
               cmpb   $0x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&) 0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&) 0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used 0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used 0x38>
               mov    $0x5f3759df,           
  • Related