Home > Software design >  how does storing into and loading from memory work
how does storing into and loading from memory work

Time:09-18

I am working on a binary analysis project where I am building a lifter that translates assembly to llvm. I built a memory model but a bit confused on how str and ldr arm assembly instructions work on the memory. So my question is. given a memory address 0000b8f0 for example in which I would like to store a 64 bit decimal value of 20000000. does the str instruction store the entire 20000000 in address 0000b8f0 or does it divide it into bytes and stores first byte in 0000b8f0 and 2nd byte in 0000b8f1 and 3rd byte in 0000b8f2 and so on...and same goes for loading from an address (0000b8f0) does the ldr instruction take just the byte stored at 0000b8f0 or the full set of bytes from 0000b8f0-0000b8f4.

sorry if my question is very basic but I need to make sure I correctly implement the effects of the str and ldr on my memory model.

CodePudding user response:

Logically1, memory is an array of 8-bit bytes.

Word load/stores access more than one byte at once, just like SIMD intrinsics in C, or like the opposite of ((char*)my_int)[2] to load the 3rd byte of an int.

C's memory model was designed around a byte-addressable machine that supports wider accesses (like PDP-11 or ARM), so it's what you're used to if you understand how char* works in C for accessing the object-representation of other objects, e.g. why memcpy works.

(I didn't use a C example of pointing an int* at a char array because the strict-aliasing rule in C makes that undefined behaviour. Only char* is allowed to alias other types in ISO C. Asm has well-defined behaviour for accessing bytes of memory with any width, with any partial or full overlap with earlier stores, as does GNU C when compiled with -fno-strict-aliasing to disable type-based alias analysis / optimization.)


str is a 32-bit word store; it writes all 4 bytes at once. If you were to load from 0000b8f1, ...2, or ...3, you'd get the 2nd, 3rd, or 4th byte, so str is equivalent to 4 separate strb instructions (with shifts to extract the right bytes), except for the obvious lack of atomicity and performance.

str always stores 4 bytes from a 32-bit register. If a register holds a value like 2, that means the upper bytes are all zero.

ARM can be big- or little-endian. I think modern ARM systems are most often little-endian, like x86, so the least-significant byte of a value is stored at the lowest address.


The byte at 0000b8f0 can't hold 20000000 on its own; a byte isn't that large, if that's what you're asking.

Note that 0000b8f4 is the low byte of the next word; it's a 4-byte-aligned address.

Also, storing an int64_t with 20000000 would require two 32-bit stores. e.g. two str instructions, or an ARMv8 stp to do a 64-bit store of a pair of registers, or an stm store-multiple instruction with two registers. Or eight strb byte-store instructions.


Footnote 1: That's from a software PoV, not how memory controllers, data busses, or DRAM chips are physically organized. Or even caches, thus byte stores and sometimes even loads can be less efficient than whole words on ARM, even apart from only moving 1/4 or 1/8th the amount of data as str or stp

CodePudding user response:

If you are asking how the software works, from a high level perspective if you squint your eyes yes. The address 0x0000b8f0 is the base address and the value 0x20000000 is stored as

0xb8f0 0x00
0xb8f1 0x00
0xb8f2 0x00
0xb8f3 0x20

But hardware is a completely different story. First off you have many buses, in a core like the ARM you may have an internal L1 cache (that you may or may not have enabled). As well as an ahb/axi/etc bus that the chip vendor connects to. These are typically going to be 32 or 64 bits wide (external to the core bus, internal to the chip). So assuming no mmu this would be a single bus transaction at address 0xb8f0 with the data 0x20000000 or 0xXXXXXXXX20000000, where the XXes can be garbage, often stale, not assumed to be zeros, why waste the gates? The caches internal or external are physically not going to be created from byte wide components, they are likely either 32 bits wide or 64 plus parity or ecc so 33 or 65 or 40 or 72 or whatever. Internal srams being multiples of 8 bits wide is not assumed, cell libraries come with hundreds of sizes and shapes (widths and depths) of srams.

Assume that reads on a bus are assumed to be bus width, that is pretty common so if you read a single byte or think you are from software it may result in a full 32 or 64 or wider, read as it traverses buses, and all those bytes/bits will come back and the processor (core) itself will isolate the byte it was interested in and do whatever the instruction wanted with it (ldrb for example). Stores on the other hand if you want to store a byte then the hardware needs to do that, so some scheme is used, and for axi/ahb, etc a byte mask is used, so if a 32 bit bus then there are 4 bits of byte mask/enable. one for each byte lane. 64 bit there are 8. so a store of a byte to address 0x0123 would essentially be a write to 0x0120 with a byte mask of 0b1000 to indicate that byte lane is getting the data, with that byte on the proper byte lane (with the other bytes assumed garbage/stale, no expectations there).

Assuming you have a cache, does not matter which layer. They are as mentioned ideally multiples of the bus that feeds them so if a 32 bit bus then 32 bits plus parity or ecc wide (33, 40, whatever). A store of a single byte as a result causes a read-modify-write as the cache sram itself can only be read/written in 32 or 64 bit wide transactions (and the address is as such you do not use/need the lower address bits they are stripped off to make word based or double word based addressing), so the logic will read the whole word or double word, modify the one byte you wanted to write, and then write that back into the sram. This is a performance hit, depending on the overall architecture as to if you can easily detect it and what overall performance you lose (you might have so much other overhead like an x86 does to not see it).

A word sized store in hardware is in no way, shape, or form, equivalent to four byte sized stores at four different addresses. And I have demonstrated this at this site per the request of another person answering this question. From a software perspective yes, as shown above (assuming little ending or be-8 big endian), that is a functional equivalent. If you do a word write you can do byte reads to access those bytes at byte based addresses, if you do byte writes you can do a word read and see the bytes from those individual transactions in that read.

Also understand that a store multiple, stm, is not assumed to be individual 32 bit transactions. Be it a 32 bit or 64 bit bus, there is an overhead to every transaction a few clocks if not more where the processor side of the bus says to the chip/bus controller side of the bus, I want to do a write, okay I am ready for your write. And you declare a length, the number of bus width items you want to send. So an stmia sp!,{r0,r1,r2,r3} on a 32 bit bus would be a single transaction with 4 data cycles, handshake then four clocks on a data bus. For a 64 bit wide bus, that is not assumed, this is why arm now requires 64 bit alignment on the stack pointer. If the address is 64 bit aligned then it is one transaction with a length of 2, so two clocks on the data bus. But if it is not 64 bit aligned but 32 bit aligned it is three transactions one with a 32 bit value (byte masked to cut the bus in half) a 64 bit transaction then a 32 bit transaction. a performance hit. If the processor supports unaligned (as in not even 32 bit nor 16 bit) then I have not seen a core like that myself to know, but would assume it also is more than one transaction.


So purely from a software compilers perspective. A word sized store is a word sized store, four bytes, a 32 bit number at a base address. FUNCTIONALLY equivalent to four byte sized writes at base address 0, 1, 2, 3. But not performance equivalent nor instruction equivalent. Same goes for loads.

Many compiler authors will go further in their design to avoid byte sized load/store instructions where possible. You have two choices for a byte/char based

a = a   1;


r0 = r0   1
r0 = r0 & 0xff;
or 
shift left 24
right shift arithmetic 24

or

r0 = r0   1

With the latter knowing that going into that add the upper bits of r0 had already been padded per the variable type (signed or unsigned) and despite being a 32 bit register and an 8 bit variable type, remain as a 32 bit representation. The compiler determines when if/when it has to be modified to represent the actual width. Even with

a = a   1;
*bptr = a;

With an 8 bit variable and 8 bit pointer you can still do this using 32 bit registers in ARM.

 add r0,r0,#1
 strb r0,[r1]

As the store will ignore/mask the upper 24 bits of the register, no prep is needed.


Even shorter. writing 0x20000000 to 0xb8f0 with an str instruction is a single instruction, and a single transaction out of the processor (or internally to the L1 cache). It is not four individual byte writes. Four byte writes is a functional equivalent if you choose to use strb four times (and also assume no interrupts happen with the isr trying to read this word sized value). But you would have to explicitly do the four byte writes yourself.

  • Related