I don't use strcpy, with AVX2 I implemented strlen, memset, memcpy and memmove so the linker would be happy. I compiled libc and nostdlib the same way except I add -nostdlib -static
for the nostdlib build. For what reason could the nostdlib version be ~6% slower? I also used __attribute__
nonnull and access(write_only, 1)
and such
The only thing I can think of is maybe I messed up the AVX implementation, but my CPU has the same latency and throughput of aligned and unaligned load/store. I'm out of ideas. Profiler shows both look almost the same, just a bit of noise
I'm not sure how I can create a mini reproducible.
CodePudding user response:
On the assumption that your memcpy implementation is perfect, depending on your platform, it's very unlikely that it will outperform memcpy.
We once had an optimised memcpy function within our console build, and it would happily outperform the std version (given that we had control over the memory, we could ensure that all allocations were aligned to cache line boundaries, and we never had to worry about any remaining unaligned bytes). When we ported the code to PC however, memcpy would always outperform our own implementation (even though it was doing more work).
I tried using the exact same source code as the stdlib version, with all compile flags set to max, and it was still slower than the std version (I even got so far as to generate a version that had identical machine code - same deal - memcpy was faster).
After investigating literally every avenue, it turned out that it was the L1 instruction cache causing the difference. On console, we effectively had one process running - our game. On PC however, due to the frequency with which memcpy was being called in every device driver and app running on the machine, memcpy was almost always resident in the L1 instruction cache, on pretty much every core.
Our version of memcpy wasn't being called frequently enough within our game to ensure it lived in the L1 cache, and that was the difference.
Knowing that, we did manage to construct some artificial benchmarks where we could show that our implementation could match the performance of memcpy (memory bandwidth limitations meant any performance advantage of our implementation was negligible), however in the larger context of our game, calling memcpy was always the best option...