Why does my C code call memmove (instead of memcpy)-CodePudding

I'm using gcc 12.2 on linux. I use -nostdlib and the compiler complained about lack of memcpy and memmove. So I implemented a bad memcpy in assembly and I had memmove call abort since I always want to use memcpy.

I was wondering if I could avoid the compiler asking for memcpy (and memmove) if I implemented my own in C. The optimizer seems to notice what it really is and called the C function anyway. However since it was implemented (with me using #define memcpy mymemcpy) and since I ran it, I saw my app abort. It called my memmove implementation instead of assembly memcpy. Why is gcc calling move instead of copy?

clang calls memcpy but gcc optimizes my code better so I use it for optimized builds

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d   = *s  ;
}

Reproducible

//dummy.cpp

extern "C" {
void*malloc() { return 0; }
int read() { return 0; }
int write() { return 0; }
int memcpy() { return 0; }
int memmove() { return 0; }
}

//main.cpp
#include <unistd.h>
#include <cstdlib>
struct MyVector {
    void*p;
    long long position, length;
};

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d   = *s  ;
}

//__attribute__ ((noinline))
int func(const char*file_from_disk, MyVector*v)
{
    if (v->position   5 <= v->length ) {
        mymemcpy(v->p, file_from_disk, 5);
    }
    return 0;
}

char buf[4096];
extern "C"
int _start() {
    MyVector v{malloc(1024),0,1024};
    v.position  = read(0, v.p, 1024-5);
    int len = read(0, buf, 4096);
    func(buf, &v);
    write(1, v.p, v.position);
}

g -march=native -nostdlib -static -fno-exceptions -fno-rtti -O2 main.cpp dummy.cpp

Check using objdump -D a.out | grep call

401040: e8 db 00 00 00          call   401120 <memmove>
40108d: e8 4e 00 00 00          call   4010e0 <malloc>
4010a3: e8 48 00 00 00          call   4010f0 <read>
4010ba: e8 31 00 00 00          call   4010f0 <read>
4010c5: e8 56 ff ff ff          call   401020 <_Z4funcPKcP8MyVector>
4010d5: e8 26 00 00 00          call   401100 <write>
402023: ff 11                   call   *(%rcx)

CodePudding user response：

An exact answer requires diving into the code transformations that GCC performs and looking at how your code is transformed by GCC. That's beyond what I can do in a reasonable amount of time, but I can show you what's going on in more general terms, without diving into GCC internals.

Here's the crazy part: If you remove inline, you will get memcpy. With inline, you get memmove. I'll show the results on Godbolt and then talk about how compilers work to explain it.

The Code

Here's some test code I put on Godbolt.

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
extern inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d   = *s  ;
}

void test(void *dest, const void *src, int size)
{
    mymemcpy(dest, src, size);
}

Here's the resulting assembly

mymemcpy:
        test    edx, edx
        je      .L1
        mov     edx, edx
        jmp     memcpy
.L1:
        ret
test:
        test    edx, edx
        je      .L4
        mov     edx, edx
        jmp     memmove
.L4:
        ret

Yes, you can see that one function is getting converted to memcpy or memmove. It's not just the same code, it's just one function, which is getting transformed differently depending on whether or not it is inlined. Why?

How Optimization Passes Work

You might think of a C compiler as doing something like this:

Preprocess tokenize source files,
Parse to create AST,
Type check,
Optimize,
Emit code.

In reality, that "optimization" item is many different passes through the code, and each of those passes modify the code in different ways. These passes happen at different times during compilation, and some optimization passes may happen multiple times.

The order in which specific optimization passes occur affects the results. If you perform optimization X and then optimization Y, you get a different result from doing Y and then X. Maybe one transformation propagates information from one part of the program to another, and then a different transformation acts on that information.

Why is this relevant here?

You can see here that there's a restrict pointer src and dest. Since these pointers are restrict, GCC "should" be able to know that memcpy is acceptable, and memmove is not necessary.

However, that means that the information that src and dest are restrict pointers must be propagated to the loop which is ultimately transformed into memmove or memcpy, and that information must be propagated before the transformation takes place. You could easily first transform the loop into memmove and then, later, figure out that the arguments are restrict, but it's too late!

It looks like, somehow, the information that src and dest are restrict is getting lost when the function is inlined. This gives us a couple different theories for why this might happen:

Maybe the propagation of restrict is somehow broken after inlining, due to a bug.
Maybe GCC infers restrict from the calling function after inlining, under the assumption that the calling function has more context than the function being inlined.
Maybe the optimization passes don't happen in the right order here for the restrict to propagate to the loop. Maybe that information propagates, and then inlining is performed afterwards, and then the loop optimization happens after that.

Optimization passes (code transformation passes) are sensitive to reordering, after all. This is an extremely complicated area of compiler design.

Disabling The Optimization

Use -fno-tree-loop-distribute-patterns, or use a pragma:

#pragma GCC optimize ("no-tree-loop-distribute-patterns")

CodePudding user response：

simple use -fno-builtin command line option.

https://godbolt.org/z/3Ys1s9jPr