`clock_gettime()` yields incorrect results on Apple M1 (Monterey 12.0.1)-CodePudding

I was working today on a pet project of mine involving performance and C when I was confronted with an unusual bug.

The following code populates two 4096 by 4096 matrices using arc4random() and computes matrix multiplication.

int main() {
  struct timespec start, end;
  srand((unsigned)time(0));

  static double A[M][M];
  static double B[M][M];
  static double C[M][M];

  rand_matrix(M, A);
  rand_matrix(M, B);

  struct timespec tstart = {0, 0}, tend = {0, 0};
  clock_gettime(CLOCK_REALTIME, &tstart);
  matmul(M, C, A, B);
  clock_gettime(CLOCK_REALTIME, &tend);
  printf("some_long_computation took about %.5f seconds\n",
         ((double)tend.tv_sec   1.0e-9 * tend.tv_nsec) -
             ((double)tstart.tv_sec   1.0e-9 * tstart.tv_nsec));

  print_matrix(M, C);
}

The resulting output is :

some_long_computation took about 0.01441 seconds

I tried both MONOTONIC_CLOCK and MONOTONIC_CLOCK_RAW which yielded similarly incoherent results (0.00512).

When using time I get 4.18s user 0.07s system 93% cpu 4.530 total.

I also tried computing the time difference following this example and various others.

clock_gettime(CLOCK_REALTIME, &start);
  uint64_t start_ = nanos();
  matmul(M, C, A, B);
  uint64_t end_ = nanos();
  clock_gettime(CLOCK_REALTIME, &end);

    uint64_t diff =
        BILLION * (end.tv_sec - start.tv_sec)   end.tv_nsec - start.tv_nsec;

The last one diff is equal to 0 when compiled without optimization flags and yields similar incoherent results when the optimization flag in clang is set to -O3.

I am not sure what the issue is I spent a few hours looking into how clock_gettime is implemented didn't get any clues about what the problem is ?

For reference the same code running on Intel i7 with Ubuntu 20.04 yields more coherent results (5-10 seconds depending on optimization level).

Edit: The implementation source and header file can be examined here https://gist.github.com/jmpnz/7da0d1b392a0b81c6a13912cd78ae3f5

CodePudding user response：

There is nothing wrong with the timing. Your multiplication function (omitted from the question but included in the gist) has a bug and does not actually do matrix multiplication, so it is much faster than you expect.

void matmul(size_t N, double C[static N][N], double A[static N][N],
            double B[static N][N]) {
  for (size_t i = 0; i < N; i  ) {
    for (size_t j = 0; j < N; j  ) {
      for (size_t k = 0; k < N; k  ) {
        double acc = A[i][k] * B[k][j]; // oops
        C[i][j] = acc;
      }
    }
  }
}

You forgot to actually add the terms in the innermost loop. When optimizing, the compiler then notices that all iterations of the loop on k simply overwrite the same value, so only the last one needs to be done. It can effectively replace the loop on k with the single statement C[i][j] = A[i][N-1] * B[N-1][j], and your supposedly O(N^3) algorithm is actually O(N^2). So no wonder it runs 1024 times faster than you expected.

If you're running the same code in your x86 tests, it may be that your x86 compiler doesn't do this optimization. It requires some good inlining as this optimization can only be done if the compiler can prove that C doesn't alias A or B.

When I corrected it to

void matmul(size_t N, double C[static N][N], double A[static N][N],
            double B[static N][N]) {
  for (size_t i = 0; i < N; i  ) {
    for (size_t j = 0; j < N; j  ) {
      double acc = 0;
      for (size_t k = 0; k < N; k  ) {
        acc  = A[i][k] * B[k][j];
      }
      C[i][j] = acc;
    }
  }
}

it correctly reports taking about 1.2 seconds on my MacBook Pro M1, which seems quite reasonable.

Morals:

Minimal reproducible examples are important. The code snippet you included in the question had nothing to do with the actual bug, and so people who just tried to stare at the question itself were wasting their time. The full code in the gist helped, but should really be in the question itself, as not everyone will take the time to chase links.
Correctness testing should come before benchmarking, or at least together with it. Your program outputs the result of the matrix "multiplication" but I suppose you never actually checked that it was mathematically correct.
select isn't broken.