Is accessing elements on a 16 byte aligned contigous memory with paddings between elements faster th-CodePudding

So imagine if we had a custom memory allocator that we initialize with a size of 64, and whenever we want to construct a new element inside it, the allocator aligns every element's starting address to be a 16-byte aligned address.

For example: If the allocator allocated a memory on heap that starts from address 0x00 to 0x40, and when we allocate four 32-bit integers inside it, the memory addresses of them would be: 0x00, 0x10, 0x20 and 0x30.

Now let's imagine another custom allocator that also allocates a 16-byte aligned memory. For example if we say the size to be 15, the allocated memory address will range from 0x00 to 0x10.

But this time, instead of aligning each element's starting address to be a 16-byte aligned memory, this allocator will put the elements next to each other.

For the sake of our argument, let's again assume that we are trying to construct four 32-bit integers inside this allocator's buffer. This time, the starting addresses of these integers will be 0x00, 0x04, 0x08 and 0x0c.

My question is; which of these allocators will result in less CPU instructions, (sorry if am wrong with this, because it may result in same instructions) or will yield faster read times on a 64-bit system?

CodePudding user response：

A 16-bit Processor

Let's say I have a processor that loves to make fetches of 16-bits.

When data is aligned on a 16-bit boundary, the processor only has to make one fetch:

   --- ---   
4 | a | b |  
   --- ---

In one fetch, the processor can get both a and b (which are 8-bit quantities).

If we have data that is not on a 16-bit boundary, the processor would have to make two fetches, then do some bit shifting and masking to make a 16-bit word.

   --- ---   
2 |   | a |  
   --- ---   
4 | b |   |  
   --- ---

In the above example, the 16-bits of data are at memory location 3. In order to get the quantity ab, the processor has to make a fetch at address 2 (to get the a quantity), then make a fetch at address 4 to get the b value. This is 2 fetches instead of 1, not to mention the extra effort to get a and b into one 16-bit word.

The net effect is that your program is slowed down. Whether the efficiency loss is negligible is another matter.

A 32-bit Processor

A 32-bit processor is geared up to be most efficient at fetching 32-bit quantities. So when the processor is told to fetch data that is not on a 32-bit boundary it has to make at least one extra fetch, similar scenario to the 16-bit scenario above.

Some 32-bit processors may be optimized to also perform 8-bit fetches, so to fetch 3 bytes (3 x 8-bits), the processor would make 3 8-bit fetches. (Although it could be wired to make a 32-bit fetch and then use bit shifting and masking to place the 8-bit quantities into separate registers).

Memory Alignment and Address Decoding

Let start off with the understanding that there is one address line for every power of 2 in the address range. When accessing memory the fundamental technique is to convert the address to binary and set the address lines accordingly.

However, routing all these address lines around the PCBA board gets complicated (and costly). Some platforms may eliminate the first address line for 16-bit alignment and the first and second line for 32-bit alignments (draw the picture and you will see that these lines are not used for the associated fetches). So, with memory aligned on 32-bit boundaries, accessing items not aligned will need to make multiple fetches from memory because the memory may not have all the address lines decoded. This will slow down your program. Whether the slow down is negligible is another matter.