What is the correct ARM64(AArch64) data memory barrier usage when reading 64bit timer value from two-CodePudding

For the sequence to read 64bit timer value from two 32bit timer counters mentioned in https://developer.arm.com/documentation/100400/0001/multiprocessing/global-timer/global-timer-registers

What is the correct way to insert ARM64 memory barriers between the reads?

Is something like below proper? Can someone please explain how and what data memory barriers to use in this case?

do {
  high1 = read(base 4);
  asm volatile("dmb sy");
  low = read(base);
  asm volatile("dmb sy");
  high2 = read(base 4);
  asm volatile("dmb sy");
} while (high2 != high1);

I know question on how to read 64bit timer already exists but there is no detail of memory barrier usage there and I need this for ARM machines - How to read two 32bit counters as a 64bit integer without race condition

CodePudding user response：

There are different types of memory mapping. Each type defines how memory access is made and possible reordering of reading/writing.

Reordering in this case for example when instruction sequence high1 = read(base 4); low = read(base); is performed by CPU like low = read(base); high1 = read(base 4);. And that's perfectly reasonable from performance point of view. At stage when CPU trying to execute while (high2 != high1); generally it does not matter what register was assigned first 'low' or 'high1'. Basically CPU simply is not aware about interdependence between 2 words.

For this 64bit value situation, we should take extra steps to prevent CPU to remove this register dependency.

First and 'the most right' way is to map timer as 'Device' memory. Usually all hardware mapped memory is made 'device' memory. 'Device' memory mapping guaranty strict memory ordering. So CPU would not do any reordering of memory reading (or writing or both) and it's always will be high1, low, high2. Device memory is also uncacheable. It does not matter in this case but for something using DMA for instance, that saves from maintain cache-memory consistency. As a conclusion, any sync barriers for 'device' memory are redundant in this case.

If one want to go for troubles, hardware might be mapped as 'generic'/'common' memory. For 'generic' memory reordering is allowed. So you might finish with following situation. Say we have counter value like 0000-9999 (decimal, 4digits for high and 4 digits for low).

high1 = read(base 4); low = read(base); is reordered and executed as low = read(base); high1 = read(base 4);
low is read as 9999, after reading is finished timer is incremented.
now timer is 0001-0000
high is read as 0001
and we have 0001-9999 Reading high2 would give 0001 again and life getting very interesting from this point.

So as I see it's necessary to prevent reordering of reading high1 and low, as well as low and high2 because we could get 0001-9999 situation in both cases (well for second case it would be high1=0000, high2=0000 and low=0000 with missing 0001 placed in high).

So I'd say

do {
  high1 = read(base 4);
  asm volatile("dmb sy");
  low = read(base);
  asm volatile("dmb sy");
  high2 = read(base 4);
  // asm volatile("dmb sy"); This looks like excessive
} while (high2 != high1);

PS: it does not look like you need such strict ordering as sy, very minimal one that guarantee ordering on specific CPU should be sufficient.