I use memcpy()
to write data to a device, with a logic analyzer/PCIe analyzer, I can see the actual stores.
My device gets more stores than expected.
For example,
auto *data = new uint8_t[1024]();
for (int i=0; i<50; i ){
memcpy((void *)(addr), data, i);
}
For i=9, I see these stores:
- 4B from byte 0 to 3
- 4B from byte 4 to 7
- 3B from byte 5 to 7
- 1B-aligned only, re-writing the same data -> inefficient and useless store
- 1B the byte 8
In the end, all the 9 Bytes are written but memcpy
creates an extra store of 3B re-writing what it has already written and nothing more.
Is it the expected behavior? The question is for C and C , I'm interested in knowing why this happens, it seems very inefficient.
CodePudding user response:
Is it the expected behavior?
The expected behavior is that it can do anything it feels like (including writing past the end, especially in a "read 8 bytes into a register, modify the first byte in the register, then write 8 bytes" way) as long as the result works as if the rules for the C abstract machine were followed.
Using a logic analyzer/PCIe analyzer to see the actual stores is so far beyond the scope of "works as if the rules for the abstraction machine were followed" that it's unreasonable to have any expectations.
Specifically; you can't assume the writes will happen in any specific order, can't assume anything about the size of any individual write, can't assume writes won't overlap, can't assume there won't be writes past the end of the area, can't assume writes will actually occur at all (without volatile
), and can't even assume that CHAR_BIT
isn't larger than 8 (or that memcpy(dest, source, 10);
isn't asking to write 20 octets/"8 bit bytes").
If you need guarantees about writes, then you need to enforce those guarantees yourself (e.g. maybe create a structure of volatile
fields to force the compiler to ensure writes happen in a specific order, maybe use inline assembly with explicit fences/barriers, etc).
CodePudding user response:
memcpy
issues these instructions (described as pseudo code):
- Load four bytes from source 0 and store four bytes to destination 0.
- Load four bytes from source 4 and store four bytes to destination 4.
- Load four bytes from source 5 and store four bytes to destination 5.
The processor implements the store instructions:
- Since destination 0 is aligned, store 4 bytes to destination 0.
- Since destination 4 is aligned, store 4 bytes to destination 4.
- Since destination 5 is not aligned, store 3 bytes to destination 3 and store 1 byte to destination 8.
This is an easy and efficient way to write memcpy
:
- If length is less than four bytes, jump to separate code for that.
- Loop copying four bytes until fewer than four bytes are left.
- if length is not a multiple of four, copy four bytes from source length−4 to destination length−4.
That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.