Faulty benchmark, puzzling assembly-CodePudding

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count;   counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

Code snippet: do_timed_run:

void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E 6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E 6;
    avg_dur  = dur2 - dur1;
}

Notes:

At this point, I'm not providing the code for the function transp_matrix_tensor_mult because I don't think it is relevant.
the #pragma noinline is a debug fixture I'm using to be able to better understand the output of the disassembler.

And now for the disassembly of the function do_timed_run:

0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    $0x44f388,