Home > Software engineering >  Why is GCC emitting larger output for a bytewise copy vs memcpy?
Why is GCC emitting larger output for a bytewise copy vs memcpy?

Time:12-20

The following C11 program extracts the bit representation of a float into a uint32_t in two different ways.

#include <stdint.h>

_Static_assert(sizeof(float) == sizeof(uint32_t));

uint32_t f2i_char(float f) {
  uint32_t x;
  char const *src = (char const *)&f;
  char *dst = (char *)&x;
  *dst   = *src  ;
  *dst   = *src  ;
  *dst   = *src  ;
  *dst   = *src  ;
  return x;
}

uint32_t f2i_memcpy(float f) {
  uint32_t x;
  memcpy(&x, &f, sizeof(x));
  return x;
}

The output assembly, compiled with armgcc 10.2.1 (none eabi) is very different, even with the -Os or -O3 optimizations applied:

I'm compiling with: -mcpu=cortex-m4 -std=c11 -mfpu=fpv4-sp-d16 -mfloat-abi=hard

f2i_char:
  sub sp, sp, #16
  vstr.32 s0, [sp, #4]
  ldr r3, [sp, #4]
  strb r3, [sp, #12]
  ubfx r2, r3, #8, #8
  strb r2, [sp, #13]
  ubfx r2, r3, #16, #8
  ubfx r3, r3, #24, #8
  strb r2, [sp, #14]
  strb r3, [sp, #15]
  ldr r0, [sp, #12]
  add sp, sp, #16
  bx lr
f2i_memcpy:
  sub sp, sp, #8
  vstr.32 s0, [sp, #4]
  ldr r0, [sp, #4]
  add sp, sp, #8
  bx lr

Why isn't gcc generating the same assembly for both functions?

Godbolt example

CodePudding user response:

Avoid manual copy of data. Use memcpy. GCC knows this function very well and will not call it at all if not needed. Pointer punning can also break strict-aliasing rules,.

In none-eabi memcpy will not emit any code as the return value is passed in the same register as a parameter. No action is needed.

https://godbolt.org/z/q8v39d737

#include <stdint.h>

_Static_assert(sizeof(float) == sizeof(uint32_t));

uint32_t f2i_char(float f) {
  uint32_t x;
  char const *src = (char const *)&f;
  char *dst = (char *)&x;
  *dst   = *src  ;
  *dst   = *src  ;
  *dst   = *src  ;
  *dst   = *src  ;
  return x;
}

uint32_t f2i1(float f) {
  uint32_t x;
  memcpy(&x, &f, sizeof(x));
  return x;
}

f2i_char:
        sub     sp, sp, #8
        ubfx    r1, r0, #8, #8
        ubfx    r2, r0, #16, #8
        ubfx    r3, r0, #24, #8
        strb    r0, [sp, #4]
        strb    r1, [sp, #5]
        strb    r2, [sp, #6]
        strb    r3, [sp, #7]
        ldr     r0, [sp, #4]
        add     sp, sp, #8
        bx      lr
f2i1:
        bx      lr

EDIT:

you use -mfloat-abi=hard which forces use of the FPU in any float related operations (even not mathematical). usually, I use softfp which does hardware floating-point instructions and software floating-point linkage.

https://gcc.godbolt.org/z/z39qnvY1c

The output assembly, compiled with armgcc 10.2.1 (none eabi) is very different, even with the -Os or -O3 optimizations applied:

Your copy byte by byte and compiler has to follow your code. When you use memcpy compiler understands your intention and does not copy byte by byte. Additional float point instructions are needed because you use hard float ABI and ABI forces this operation to be done via the memory (float and int are passed via R0).

CodePudding user response:

Why is GCC emitting larger output with -Os than -O3 for this function on Cortex-M4?

Why not? Each option enables or disables specific compiler internal workings. Surely there may be and will be compiler decisions that will make -O3 result in smaller code than with -Os.

Is there anything specific about the C11 standard or the Armv7E-M that's inhibiting gcc from emitting the smaller assembly at -Os?

No.

Is this gcc missing an optimization opportunity?

Yes, you could say that. But it may be on purpose - it could be, that the optimization that causes to generate such code is really compilation time and CPU consuming, so it's disabled. It's just that.

  • Related