In my quest to understand how structure padding works in C in the Linux x86 environment, I read that aligned access is faster than access that is mis-aligned. And while I think I understand the reasons given for that, they all seem to come with the pre-supposition that the CPU can't directly access addresses that are not a multiple of the bus width, and so, for instance, if a 32-bit bus CPU was instructed to read 4 bytes of memory starting from address "2", it would first read 4 bytes starting from address "0", mask the first two bytes, read another 4 bytes starting from address "4", mask the last two bytes, and lastly combine the two results, as opposed to just being able to read the 4 bytes at once if they were 4 bytes aligned.
So, my question is this: Why is that pre-supposition true? Why can't the CPU read from addresses that are not a multiple of the bus width?
CodePudding user response:
Technically nothing prevents you from making a machine that can address any address on the memory bus. It's just that everything is much simpler if you make the address a multiple of the bus size.
This means that for example, to make a 32-bit memory, you can just take 4x 8-bit memory chips, plug each one on a fourth of the data bus with the same address bus. When you send a single address, all 4 chips will read/write their corresponding 8-bit to form a 32-bit word. Notice that to do that, you ignore the lowest 2 address bits, since you get 32-bits for a single access, and essentially force the address to be 32-bit aligned.
Addr | bus addr | CHIP0 CHIP2 CHIP3 CHIP1 | Value read
@0x00 => 000000b 0x59 0x32 0xaa 0xba 0x5932aaba
@0x04 => 000001b 0x84 0xff 0x51 0x32 0x84ff5132
To access non-aligned words in such a configuration, you would need to send two different bus addresses to the 4 chips since maybe two chips have your value on address 0, and the two others on address 1. Which means you either need to have several address busses, or make multiple accesses anyway.
Note that modern DRAM is obviously more complex, and accesses are multiple of cache lines, so much bigger than the bus.
Overall, in most memory use cases, rounding the accesses make things simpler.