Home > Mobile >  Why does LDR sometimes take 20 CPU cycles?
Why does LDR sometimes take 20 CPU cycles?

Time:04-24

I am having and issue with the LDR and STR instructions in ARM Cortex M4 assembly. For some reason, they take way longer to write/read certain parts in memory than to others.

To illustrate this, I’ve setup this simple example:

I have created a project with a main C file, and a neighboring “.S” file containing the assembly code. I’ve included the assembly functions into my C file using the “extern” object.

//Add the asm functions to our C code
extern "C" void LoadTest(uint32_t *memory_adress);
extern "C" void LoadTestLoop(uint32_t *memory_adress);

Here is what the program does:

void perform_test()
{
  //Time
  register uint32_t register_before_time=before_time;
  register uint32_t register_after_time=after_time;

  register uint32_t* input_address=0x400E9000;
  
  register_before_time=ARM_DWT_CYCCNT; 

  //Time measurment occurs in here!
  LoadTestLoop(input_address);
  
  register_after_time=ARM_DWT_CYCCNT;

  Serial.print(" Time: ");
  Serial.println(register_after_time-register_before_time-time_error);
}

It shows us the time it’s taken something to execute in between the “register_before_time=ARM_DWT_CYCCNT;” and “register_after_time=ARM_DWT_CYCCNT;” lines.

Here are the assembly subroutines we will be testing for their speed:

.global LoadTest
LoadTest:
    ldr r1, [r0]                        /*Load value into r1 from memory_address*/
    orr r1, #0xC0                       /*OR bits 7,6 to be on.*/
    str r1, [r0]                        /*Store the changed value back into memory_address*/
    bx lr
.global LoadTestLoop
LoadTestLoop:
    mov r2, #255                        /* Set r2 to be 255 for the loop*/
    
    TestLoop:                           /*Same code as before*/
        ldr r1, [r0]                        
        orr r1, #0xC0                   
        str r1, [r0]
        
        subs r2, r2, #1                 /*Decrement r2   set Z flag if it's zero*/
        bne TestLoop                    /*Repeat until r2==0*/
    bx lr

LoadTest – Loads a value from the address we give it. ORs the value with 0xC0 and then stores it back to the same address.

LoadTestLoop – Does the same thing, however, does it in a loop 255 times, this way we can get a average of how long one loop iteration takes, and minimize the time measurement errors from the branching instructions going in and out of the function.

Note: To also minimize measurement errors, the address to work on is provided to both functions outside of the time measurement zone, in the input_address pointer.

register uint32_t* input_address=0x400E9000;

Test results and the issue:

I ran these two tests for both normal C variables

uint32_t test_value=255;
register uint32_t* input_address=&test_value;

And for the configuration registers inside the microcontrollers. Note that in the datasheet they are presented as just memory.

register uint32_t* input_address=0x400E9000;

On average LoadTest for standard variables took 9 cycles to execute, but much longer at 27 cycles for the control registers. The LoadTestLoop tests reinforced this with standard variables taking on average 1541 cycles (6 cycles per iteration) and the control registers a astounding 12227 cycles, which works out to a crazy 47 cycles per iteration!

Why is this happening?

Why does LDR and STR sometimes take way longer to execute? Does it have something to do with the little “b” written next to the cycle count on Notice the little blue b next to the number 2

Does anybody know why this is happening? I’ve been bugged by this question for a long time and would really like to know.

Thank You for the help

CodePudding user response:

This is completely normal.

In general, a load from memory takes as long as it takes. The timing isn't under the control of the CPU itself, so a quoted cycle count can only represent a "best case". If the CPU can't fulfill the load from its own internal structures (e.g. store buffer or L1 cache), then it just has to put the request out on the memory bus and stall until the memory subsystem responds. (Or go on executing later instructions out-of-order, if so equipped and if it can find some that don't depend on the result of the load.)

The actual time taken can be highly variable, depending for instance on whether the load hits or misses L2 or L3 cache, whether another core or external device holds a bus lock, etc. If the machine has no cache and all memory is fast SRAM, then the time could be pretty stable.

But in your case the address you're loading is actually mapped to a hardware device. So you're not really reading RAM at all, you're doing I/O. In this case, the response has to come from the device itself, and the device can essentially take as long as it needs. If you need to be able to predict the time, then you need to be looking at the documentation of that device (and any interface hardware in between), not at cycle counts in the CPU manual.

  • Related