Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
Code snippet: do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E 6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E 6;
avg_dur = dur2 - dur1;
}
Notes:
- At this point, I'm not providing the code for the function
transp_matrix_tensor_mult
because I don't think it is relevant. - the
#pragma noinline
is a debug fixture I'm using to be able to better understand the output of the disassembler.
And now for the disassembly of the function do_timed_run
:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov $0x44f388,