I'm using gcc 12.2 on linux. I use -nostdlib
and the compiler complained about lack of memcpy and memmove. So I implemented a bad memcpy in assembly and I had memmove call abort since I always want to use memcpy.
I was wondering if I could avoid the compiler asking for memcpy (and memmove) if I implemented my own in C. The optimizer seems to notice what it really is and called the C function anyway. However since it was implemented (with me using #define memcpy mymemcpy
) and since I ran it, I saw my app abort. It called my memmove implementation instead of assembly memcpy. Why is gcc calling move instead of copy?
clang calls memcpy but gcc optimizes my code better so I use it for optimized builds
__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
const unsigned char *s = (const unsigned char*)src;
unsigned char *d = (unsigned char*)dest;
while(size--) *d = *s ;
}
Reproducible
//dummy.cpp
extern "C" {
void*malloc() { return 0; }
int read() { return 0; }
int write() { return 0; }
int memcpy() { return 0; }
int memmove() { return 0; }
}
//main.cpp
#include <unistd.h>
#include <cstdlib>
struct MyVector {
void*p;
long long position, length;
};
__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
const unsigned char *s = (const unsigned char*)src;
unsigned char *d = (unsigned char*)dest;
while(size--) *d = *s ;
}
//__attribute__ ((noinline))
int func(const char*file_from_disk, MyVector*v)
{
if (v->position 5 <= v->length ) {
mymemcpy(v->p, file_from_disk, 5);
}
return 0;
}
char buf[4096];
extern "C"
int _start() {
MyVector v{malloc(1024),0,1024};
v.position = read(0, v.p, 1024-5);
int len = read(0, buf, 4096);
func(buf, &v);
write(1, v.p, v.position);
}
g -march=native -nostdlib -static -fno-exceptions -fno-rtti -O2 main.cpp dummy.cpp
Check using objdump -D a.out | grep call
401040: e8 db 00 00 00 call 401120 <memmove>
40108d: e8 4e 00 00 00 call 4010e0 <malloc>
4010a3: e8 48 00 00 00 call 4010f0 <read>
4010ba: e8 31 00 00 00 call 4010f0 <read>
4010c5: e8 56 ff ff ff call 401020 <_Z4funcPKcP8MyVector>
4010d5: e8 26 00 00 00 call 401100 <write>
402023: ff 11 call *(%rcx)
CodePudding user response:
An exact answer requires diving into the code transformations that GCC performs and looking at how your code is transformed by GCC. That's beyond what I can do in a reasonable amount of time, but I can show you what's going on in more general terms, without diving into GCC internals.
Here's the crazy part: If you remove inline
, you will get memcpy
. With inline
, you get memmove
. I'll show the results on Godbolt and then talk about how compilers work to explain it.
The Code
Here's some test code I put on Godbolt.
__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
extern inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
const unsigned char *s = (const unsigned char*)src;
unsigned char *d = (unsigned char*)dest;
while(size--) *d = *s ;
}
void test(void *dest, const void *src, int size)
{
mymemcpy(dest, src, size);
}
Here's the resulting assembly
mymemcpy:
test edx, edx
je .L1
mov edx, edx
jmp memcpy
.L1:
ret
test:
test edx, edx
je .L4
mov edx, edx
jmp memmove
.L4:
ret
Yes, you can see that one function is getting converted to memcpy
or memmove
. It's not just the same code, it's just one function, which is getting transformed differently depending on whether or not it is inlined. Why?
How Optimization Passes Work
You might think of a C compiler as doing something like this:
Preprocess tokenize source files,
Parse to create AST,
Type check,
Optimize,
Emit code.
In reality, that "optimization" item is many different passes through the code, and each of those passes modify the code in different ways. These passes happen at different times during compilation, and some optimization passes may happen multiple times.
The order in which specific optimization passes occur affects the results. If you perform optimization X and then optimization Y, you get a different result from doing Y and then X. Maybe one transformation propagates information from one part of the program to another, and then a different transformation acts on that information.
Why is this relevant here?
You can see here that there's a restrict
pointer src
and dest
. Since these pointers are restrict
, GCC "should" be able to know that memcpy
is acceptable, and memmove
is not necessary.
However, that means that the information that src
and dest
are restrict
pointers must be propagated to the loop which is ultimately transformed into memmove
or memcpy
, and that information must be propagated before the transformation takes place. You could easily first transform the loop into memmove
and then, later, figure out that the arguments are restrict
, but it's too late!
It looks like, somehow, the information that src
and dest
are restrict
is getting lost when the function is inlined. This gives us a couple different theories for why this might happen:
Maybe the propagation of
restrict
is somehow broken after inlining, due to a bug.Maybe GCC infers
restrict
from the calling function after inlining, under the assumption that the calling function has more context than the function being inlined.Maybe the optimization passes don't happen in the right order here for the
restrict
to propagate to the loop. Maybe that information propagates, and then inlining is performed afterwards, and then the loop optimization happens after that.
Optimization passes (code transformation passes) are sensitive to reordering, after all. This is an extremely complicated area of compiler design.
Disabling The Optimization
Use -fno-tree-loop-distribute-patterns
, or use a pragma:
#pragma GCC optimize ("no-tree-loop-distribute-patterns")
CodePudding user response:
simple use -fno-builtin
command line option.