Where is the string saved when we do arm assembly?-CodePudding

My book talks about the "dynamic data segment" and "global data segment". In the below arm code, where is the string "Hello World!" saved, and how is it saved? Is each letter a byte? If so, how does it know where to start and end?

.text
.global main
main:
    push {lr}    

    ldr r0, =string
    bl printf

    mov r0, $0
    pop {lr}
    bx lr

.data 
string: .asciz "Hello World!\n"

CodePudding user response：

It sounds like you should get a better book! This program is incorrect because it calls the printf function while the stack is misaligned. All the major ABIs used on the ARM platform require the stack to be 8-byte aligned on calling a function.

To answer your question, if you write a program in C then it is up to your compiler where it puts the string, although there are some established conventions. Because your program is written in assembly, you have to tell it where to put it. Here the .data directive puts the string in the .data section. This is probably what your dodgy book is calling the "global data segment". If I had to guess, I would think it is using the term "dynamic data segment" to refer to the heap, which isn't actually ever a segment in the output program, but is accessed through functions like malloc.

CodePudding user response：

Perhaps a better book but also learn how to use google.

It is not the compiler that chooses, it is you the programmer that ultimately chooses where these things go. If you choose to use a pre-built bundle like the gnu tools for your platform. For gnu the C library and bootstrap and linker script are all intimately related and what address space things land in is defined by that linker script.

You can see the .asciz, which means ASCII, which you can easily google and see how those characters are represented in binary. No point whatsoever covering that topic here, should have done that before you asked the question.

Yes the unaligned stack does not conform to the current ARM ABI, but this code will still assemble. And surprised as with others the $0 works instead of #0, just more proof that assembly language is specific to the tool not the target.

I removed the printf to make this example simple as it does not matter.

.text
.global main
main:
    push {lr}    

    ldr r0, =string
    @bl printf

    mov r0, $0
    pop {lr}
    bx lr

.data 
string: .asciz "Hello World!\n"

assemble and disassemble

Disassembly of section .text:

00000000 <main>:
   0:   e52de004    push    {lr}        ; (str lr, [sp, #-4]!)
   4:   e59f0008    ldr r0, [pc, #8]    ; 14 <main 0x14>
   8:   e3a00000    mov r0, #0
   c:   e49de004    pop {lr}        ; (ldr lr, [sp], #4)
  10:   e12fff1e    bx  lr
  14:   00000000    andeq   r0, r0, r0

Disassembly of section .data:

00000000 <string>:
   0:   6c6c6548    cfstr64vs   mvdx6, [ip], #-288  ; 0xfffffee0
   4:   6f57206f    svcvs   0x0057206f
   8:   21646c72    smccs   18114   ; 0x46c2
   c:   Address 0x000000000000000c is out of bounds.

I used a disassembler so it is trying to disassemble the ASCII data as instructions, you can see the bytes and compare that to what you found when you used google.

This is unlinked so the sections do not have a base address yet so they are at zero for the object. You can see that the pseudo language ldr r0, =string turns into a pc relative load of a nearby word, since the assembler does not know the value at assemble time. We can link it with something simple like this

MEMORY
{
    one : ORIGIN = 0x00001000, LENGTH = 0x1000
    two : ORIGIN = 0x00002000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > one
    .data : { *(.data*) } > two
}

Giving

Disassembly of section .text:

00001000 <main>:
    1000:   e52de004    push    {lr}        ; (str lr, [sp, #-4]!)
    1004:   e59f0008    ldr r0, [pc, #8]    ; 1014 <main 0x14>
    1008:   e3a00000    mov r0, #0
    100c:   e49de004    pop {lr}        ; (ldr lr, [sp], #4)
    1010:   e12fff1e    bx  lr
    1014:   00002000    andeq   r2, r0, r0

Disassembly of section .data:

00002000 <string>:
    2000:   6c6c6548    cfstr64vs   mvdx6, [ip], #-288  ; 0xfffffee0
    2004:   6f57206f    svcvs   0x0057206f
    2008:   21646c72    smccs   18114   ; 0x46c2
    200c:   Address 0x000000000000200c is out of bounds.

So you can see that as the programmer I chose where these things go, and you can also see that in the code the address to the string has been filled in by the linker.

Clearly this is not an executable that we can expect to run, you have bootstrap code you need and a number of other things.

The address space is specific to the target, so while we the programmer controls where things go, the operating system has rules for where things can go, if .data and .bss are setup by the OS or we have to do it in the bootstrap, etc. And of course the C library if you choose to use it, is heavily connected to the operating system as most calls require a system call and system calls are very specific to both the operating system (and version) and the target (processor/architecture). So the bootstrap, the C library and the linker script are inseparable you cannot mix and match and expect much success. if your toolchain has a C library installed and associated with it then if you choose a different toolchain for the same computer/operating system/processor. Then it is not assumed that the exact memory locations will be used by each linker script. As they are free to choose from the rules of the operating system for the address space for an application. (also, obviously, assembly language is not expected to port from one toolchain to another on the same system, so you may have to make modifications or just try an int 5; int main(void) { return(0); } to see what the linker does.

Binary format of the string, obvious, you specified it. Where do things go, the linker links the objects together per some rules that have to conform to the target be it an operating system or a microcontroller address space, etc.

How does it know where to start and end, well we covered the topic of start above. End, google, you are calling a C function and passing it a C string, so that covers that. Also you specified the termination of the string in your code, so you pretty much already know how the end is defined.