Why does static_cast conversion speed up an un-optimized build of my integer division function?-CodePudding

... or rather, why does not static_cast-ing slow down my function?

Consider the function below, which performs integer division:

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = static_cast<long>(y) << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret  = 1 << i, x -= j;
  }
  return ret;
}

This performs reasonably well, as one might expect. However, if we remove the static_cast on line 3, like so:

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = y << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret  = 1 << i, x -= j;
  }
  return ret;
}

This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x is large and y is small. I was curious and wanted to look into why, and tried digging into the assembly code. However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference (source):

With static_cast:

movsxd  rax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     rax, cl
mov     qword ptr [rbp - 24], rax

Without static_cast:

mov     eax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     eax, cl
cdqe
mov     qword ptr [rbp - 24], rax

The rest is identical.

I'm really curious where the overhead is occurring.

EDIT: I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y is initialized. The additional cdqe instruction doesn't seem to be significant enough to warrant the total increase in wall time.

Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:

I'm aware that shifting an int further than 32 bits is UB.
I'm assuming only positive inputs.
long is 8 bytes long on my platform, so it doesn't overflow.

I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.

CodePudding user response：

Widening after the shift reduces your loop to naive repeated subtraction

It's not the run-time of cdqe or movsxd vs. mov that's relevant, it's the different starting values for your loop, resulting in a different iteration count, especially for pathological cases.

Clang without optimization compiled your source exactly the way it was written, doing the shift on an int and then sign-extending the result to long. The shift-count UB is invisible to the compiler with optimization disabled because, for consistent debugging, it assumes variable values can change between statements, so the behaviour depends on what the target machine does with a shift-count by the operand-size.

When compiling for x86-64, that results in long j = (long)(y<<0);, i.e. long j = y;, rather than having those bits at the top of a 64-bit value.

x86 scalar shifts like shl eax, cl mask the count with &31 (except with 64-bit operand size) so the shift used a count of 32 % 32 == 0. AArch64 would I think saturate the shift count, i.e. let you shift out all the bits.

Notice that it does a 32-bit operand-size shl eax, cl and then sign-extends the result with cdqe, instead of doing a sign-extending reload of y and then a 64-bit operand-size shl rax,cl.

Your loop has a data-dependent iteration count

If you single-step with a debugger, you could see the local variable values accurately. (That's the main benefit of an un-optimized debug build, which is not what you should be benchmarking.) And you can count iterations.

  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret  = 1 << i, x -= j;
  }

With j = y, if we enter the outer loop at all, then the inner loop condition is always false.
So it never runs, j stays constant the whole time, and i stays constant at 32.

1<<32 again compiles to a variable-count shift with 32-bit operand-size, because 1 has type int. (1LL has type long long, and can safely be left-shifted by 32). On x86-64, this is just a slow way to do ret = 1;.

x -= j; is of course just x -= y;, so we're counting how many subtractions to make x < y.

It's well-known that division by repeated subtraction is extremely slow for large quotients, since the run time scales linearly with the quotient.

You do happen to get the right result, though. Yay.

BTW, long is only 32-bit on some targets like Windows x64 and 32-bit platforms; use long long or int64_t if you want a type twice the width of int. And maybe static_assert to make sure int isn't that wide.

With optimization enabled, I think the same things would still hold true: clang looks like it's compiling to similar asm just without the store/reload. So it's effectively / de-facto defining the behaviour of 1<<32 to just compile to an x86 shift instruction.

But I didn't test, that's just from a quick look at the asm https://godbolt.org/z/M33vqGj5P and noting things like mov r8d, 1 ; shl r8d, cl (32-bit operand-size) ; add eax, r8d

CodePudding user response：

I'm keeping this answer up for now as the comments are useful.

CodePudding user response：

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = y << i;

On most systems, the size of int is 32 bits or less. Left-shifting a signed integer by equal or higher number of bits as its size results in undefined behaviour. Don't do this. Since the program is broken, it's irrelevant whether it's slower or faster.

_{Sidenote: Left shifting a signed 32 bit integer by 31 or fewer bits may also be undefined if that shift causes the left most bit to change due to arithmetic overflow.}