ARM Thumb GCC Disassembled C. Caller-saved registers not saved and loading and storing same register-CodePudding

Context: STM32F469 Cortex-M4 (ARMv7-M Thumb-2), Win 10, GCC, STM32CubeIDE; Learning/Trying out inline assembly & reading disassembly, stack managements etc., writing to core registers, observing contents of registers, examining RAM around stack pointer to understand how things work.

I've noticed that at some point, when I call a function, in the beginning of a called function, which received an argument, the instructions generated for the C function do "store R3 at RAM address X" followed immediately "Read RAM address X and store in RAM". So it's writing and reading the same value back, R3 is not changed. If it only had wanted to save the value of R3 onto the stack, why load it back then?

C code, caller function (main), my code:

asm volatile("      LDR R0,=#0x00000000\n"
            "       LDR R1,=#0x11111111\n"
            "       LDR R2,=#0x22222222\n"
            "       LDR R3,=#0x33333333\n"
            "       LDR R4,=#0x44444444\n"
            "       LDR R5,=#0x55555555\n"
            "       LDR R6,=#0x66666666\n"
            "       MOV R7,R7\n" //Stack pointer value is here, used for stack data access
            "       LDR R8,=#0x88888888\n"
            "       LDR R9,=#0x99999999\n"
            "       LDR R10,=#0xAAAAAAAA\n"
            "       LDR R11,=#0xBBBBBBBB\n"
            "       LDR R12,=#0xCCCCCCCC\n"
    );
testInt = addFifteen(testInt); //testInt=0x03; returns uint8_t, argument uint8_t

Function call generates instructions to load function argument into R3, then move it to R0, then branch with link to addFifteen. So by the time I enter addFifteen, R0 and R3 have value 0x03 (testInt). So far so good. Here is what function call looks like:

testInt = addFifteen(testInt);
08000272:   ldrb    r3, [r7, #11]
08000274:   mov     r0, r3
08000276:   bl      0x80001f0 <addFifteen>

So I go into addFifteen, my C code for addFifteen:

uint8_t addFifteen(uint8_t input){
    return (input   15U);
}

and its disassembly:

    addFifteen:
080001f0:   push    {r7}
080001f2:   sub     sp, #12
080001f4:   add     r7, sp, #0
080001f6:   mov     r3, r0
080001f8:   strb    r3, [r7, #7]
080001fa:   ldrb    r3, [r7, #7]
080001fc:   adds    r3, #15
080001fe:   uxtb    r3, r3
08000200:   mov     r0, r3
08000202:   adds    r7, #12
08000204:   mov     sp, r7
08000206:   ldr.w   r7, [sp], #4
0800020a:   bx      lr

My primary interest is in 1f8 and 1fa lines. It stored R3 on stack and then loads freshly written value back into the register that still holds the value anyway.

Questions are:

What is the purpose of this "store register A into RAM X, next read value from RAM X into register A"? Read instruction doesn't seem to serve any purpose. Make sure RAM write is complete?
Push{r7} instruction makes stack 4-byte aligned instead of 8-byte aligned. But immediately after that instruction we have SP decremented by 12 (bytes), so it becomes 8-byte aligned again. Therefore, this behavior is ok. Is this statement correct? What if an interrupt happens between these two instructions? Will alignment be fixed during ISR stacking for the duration of ISR?
From what I read about caller/callee saved registers (very hard to find any sort of well-organized information on that, if you have good material, please, share a link), at least R0-R3 must be placed on stack when I call a function. However, it's easy to notice in this case that NONE of the registers were pushed on stack, and I verified it by checking memory around stack pointer, it would have been easy to notice 0x11111111 and 0x22222222, but they aren't there, and nothing is pushing them there. The values in R0 and R3 that I had before I called the function are simply gone forever. Why weren't any registers pushed on stack before function call? I would expect to have R3 0x33333333 when addFifteen returns because that's how it was before function call, but that value is casually overwritten even before branch to addFifteen. Why didn't GCC generate instructions to push R0-R3 onto the stack and only after that branch with link to addFifteen?

If you need some compiler settings, please, let me know where to find them in Eclipse (STM32CubeIDE) and what exactly you need there, I will happily provide them and add them to the question here.

CodePudding user response：

I'm not sure there is a specific purpose to do it. it is just one solution that the compiler has found to do it.

For example the code:

unsigned int f(unsigned int a)
{ 
   return sqrt(a   1);
}

compiles with ARM GCC 9 NONE with optimisation level -O0 to:

    push    {r7, lr}
    sub     sp, sp, #8
    add     r7, sp, #0
    str     r0, [r7, #4]
    ldr     r3, [r7, #4]
    adds    r3, r3, #1
    mov     r0, r3
    bl      __aeabi_ui2d
    mov     r2, r0
    mov     r3, r1
    mov     r0, r2
    mov     r1, r3
    bl      sqrt
    ...

and in level -O1 to:

    push    {r3, lr}
    adds    r0, r0, #1
    bl      __aeabi_ui2d
    bl      sqrt
    ...

As you can see the asm is much easier to understand in -O1: store parameter in R0, add 1, call functions.

The hardware supports non aligned stack during exception. See here
The "caller saved" registers do not necessarily need to be stored on the stack, it's up to the caller to know whether it needs to store them or not. Here you are mixing (if I understood correctly) C and assembly: so you have to do the compiler job before switching back to C: either you store values in callee saved registers (and then you know by convention that the compiler will store them during function call) or you store them yourself on the stack.

CodePudding user response：

uint8_t addFifteen(uint8_t input){
    return (input   15U);
}

What you are looking at here is unoptimized and at least with gnu the input and local variables get a memory location on the stack.

00000000 <addFifteen>:
   0:   b480        push    {r7}
   2:   b083        sub sp, #12
   4:   af00        add r7, sp, #0
   6:   4603        mov r3, r0
   8:   71fb        strb    r3, [r7, #7]
   a:   79fb        ldrb    r3, [r7, #7]
   c:   330f        adds    r3, #15
   e:   b2db        uxtb    r3, r3
  10:   4618        mov r0, r3
  12:   370c        adds    r7, #12
  14:   46bd        mov sp, r7
  16:   bc80        pop {r7}
  18:   4770        bx  lr

What you see with r3 is that the input variable, input, comes in r0. For some reason, code is not optimized, it goes into r3, then it is saved in its memory location on the stack.

Setup the stack

00000000 <addFifteen>:
   0:   b480        push    {r7}
   2:   b083        sub sp, #12
   4:   af00        add r7, sp, #0

save input to the stack

   6:   4603        mov r3, r0
   8:   71fb        strb    r3, [r7, #7]

so now we can start implementing the code in the function which wants to do math on the input function, so do that math

   a:   79fb        ldrb    r3, [r7, #7]
   c:   330f        adds    r3, #15

Convert the result to an unsigned char.

   e:   b2db        uxtb    r3, r3

Now prepare the return value

  10:   4618        mov r0, r3

and clean up and return

  12:   370c        adds    r7, #12
  14:   46bd        mov sp, r7
  16:   bc80        pop {r7}
  18:   4770        bx  lr

Now if I tell it not to use a frame pointer (just a waste of a register).

00000000 <addFifteen>:
   0:   b082        sub sp, #8
   2:   4603        mov r3, r0
   4:   f88d 3007   strb.w  r3, [sp, #7]
   8:   f89d 3007   ldrb.w  r3, [sp, #7]
   c:   330f        adds    r3, #15
   e:   b2db        uxtb    r3, r3
  10:   4618        mov r0, r3
  12:   b002        add sp, #8
  14:   4770        bx  lr

And you can still see each of the fundamental steps in implementing the function. Unoptimized.

Now if you optimize

00000000 <addFifteen>:
   0:   300f        adds    r0, #15
   2:   b2c0        uxtb    r0, r0
   4:   4770        bx  lr

It removes all the excess.

number two.

Yes I agree this looks wrong, but gnu certainly does not keep the stack on an alignment at all times, so this looks wrong. But I have not read the details on the arm calling convention. Nor have I read to see what gcc's interpretation is. Granted they may claim a spec, but at the end of the day the compiler authors choose the calling convention for their compiler, they are under no obligation to arm or intel or others to conform to any spec. Their choice, and like the C language itself, there are lots of places where it is implementation defined and gnu implements the C language one way and others another way. Perhaps this is the same. Same goes for this saving of the incoming variable to the stack. We will see that llvm/clang does not.

number three.

r0-r3 and another register or two may be called caller saved, but the better way to think of them is volatile. The callee is free to modify them without saving them. It is not so much a case of saving the r0 register, but instead r0 represents a variable and you are managing that variable in functionally implementing the high level code.

For example

unsigned int fun1 ( void );
unsigned int fun0 ( unsigned int x )
{
    return(fun1() x);
}
00000000 <fun0>:
   0:   b510        push    {r4, lr}
   2:   4604        mov r4, r0
   4:   f7ff fffe   bl  0 <fun1>
   8:   4420        add r0, r4
   a:   bd10        pop {r4, pc}

x comes in in r0, and we need to preserve that value until after fun1() is called. r0 can be destroyed/modified by fun1(). So in this case they save r4, not r0, and keep x in r4.

clang does this as well

00000000 <fun0>:
   0:   b5d0        push    {r4, r6, r7, lr}
   2:   af02        add r7, sp, #8
   4:   4604        mov r4, r0
   6:   f7ff fffe   bl  0 <fun1>
   a:   1900        adds    r0, r0, r4
   c:   bdd0        pop {r4, r6, r7, pc}

Back to your function.

clang, unoptimized also keeps the input variable in memory (stack).

00000000 <addFifteen>:
   0:   b081        sub sp, #4
   2:   f88d 0003   strb.w  r0, [sp, #3]
   6:   f89d 0003   ldrb.w  r0, [sp, #3]
   a:   300f        adds    r0, #15
   c:   b2c0        uxtb    r0, r0
   e:   b001        add sp, #4
  10:   4770        bx  lr

and you can see the same steps, prep the stack, store the input variable. Take the input variable do the math. Prepare the return value. Clean up, return.

Clang/llvm optimized:

00000000 <addFifteen>:
   0:   300f        adds    r0, #15
   2:   b2c0        uxtb    r0, r0
   4:   4770        bx  lr

Happens to be the same as gnu. Not expected that any two different compilers generate the same code, nor any expectation that any two versions of the same compiler generate the same code.

unoptimized, the input and local variables (none in this case) get a home on the stack. So what you are seeing is the input variable being put in its home on the stack as part of the setup of the function. Then the function itself wants to operate on that variable so, unoptimized, it needs to fetch that value from memory to create an intermediate variable (that in this case did not get a home on the stack) and so on. You see this with volatile variables as well. They will get written to memory then read back then modified then written to memory and read back, etc...
yes I agree, but I have not read the specs. End of the day it is gcc's calling convention or interpretation of some spec they choose to use. They have been doing this (not being aligned 100% of the time) for a long time and it does not fail. For all called functions they are aligned when the functions are called. Interrupts in arm code generated by gcc is not aligned all the time. Been this way since they adopted that spec.
by definition r0-r3, etc are volatile. The callee can modify them at will. The callee only needs to save/preserve them if IT needs them. In both the unoptimized and optimized cases only r0 matters for your function it is the input variable and it is used for the return value. You saw in the function I created that the input variable was preserved for later, even when optimized. But, by definition, the caller assumes these registers are destroyed by called functions, and called functions can destroy the contents of these registers and no need to save them.

As far as inline assembly goes, which is a different assembly language than "real" assembly language. I think you have a ways to go before being ready for that, but maybe not. After decades of constant bare metal work I have found zero real use cases for inline assembly, the cases I see are laziness avoiding allowing real assembly into the make system or ways to avoid writing real assembly language. I see it as a ghee whiz feature that folks use like unions and bitfields.

Within gnu, for arm, you have at least four incompatible assembly languages for arm. The not unified syntax real assembly, the unified syntax real assembly. The assembly language that you see when you use gcc to assemble instead of as and then inline assembly for gcc. Despite claims of compatibility clang arm assembly language is not 100% compatible with gnu assembly language and llvm/clang does not have a separate assembler you feed it to the compiler. Arms various toolchains over the years have completely incompatible assembly language to gnu for arm. This is all expected and normal. Assembly language is specific to the tool not the target.

Before you can get into inline assembly language learn some of the real assembly language. And to be fair perhaps you do, and perhaps quite well, and this question is about the discover of how compilers generate code, and how strange it looks as you find out that it is not some one to one thing (all tools in all cases generate the same output from the same input).

For inline asm, while you can specify registers, depending on what you are doing, you generally want to let the compiler choose the register, most of the work for inline assembly is not the assembly but the language that specific compiler uses to interface it...which is compiler specific, move to another compiler and the expectation is a whole new language to learn. While moving between assemblers is also a whole new language at least the syntax of the instructions themselves tend to be the same and the language differences are in everything else, labels and directives and such. And if lucky and it is a toolchain not just an assembler, you can look at the output of the compiler to start to understand the language and compare it to any documentation you can find. Gnus documentation is pretty bad in this case, so a lot of reverse engineering is needed. At the same time you are more likely to be successful with gnu tools over any other, not because they are better, in many cases they are not, but because of the sheer user base and the common features across targets and over decades of history.

I would get really good at interfacing asm with C by creating mock C functions to see which registers are used, etc. And/or even better, implement it in C, compile it, then hand modify/improve/whatever the output of the compiler (you do not need to be a guru to beat the compiler, to be as consistent, perhaps, but fairly often you can easily see improvements that can be made on the output of gcc, and gcc has been getting worse over the last several versions it is not getting better, as you can see from time to time on this site). Get strong in the asm for this toolchain and target and how the compiler works, and then perhaps learn the gnu inline assembly language.