... or rather, why does not static_cast-ing slow down my function?
Consider the function below, which performs integer division:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = static_cast<long>(y) << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret = 1 << i, x -= j;
}
return ret;
}
This performs reasonably well, as one might expect.
However, if we remove the static_cast
on line 3, like so:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = y << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret = 1 << i, x -= j;
}
return ret;
}
This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x
is large and y
is small. I was curious and wanted to look into why, and tried digging into the assembly code. However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference (source):
With static_cast
:
movsxd rax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl rax, cl
mov qword ptr [rbp - 24], rax
Without static_cast
:
mov eax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl eax, cl
cdqe
mov qword ptr [rbp - 24], rax
The rest is identical.
I'm really curious where the overhead is occurring.
EDIT: I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y
is initialized. The additional cdqe
instruction doesn't seem to be significant enough to warrant the total increase in wall time.
Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:
- I'm aware that shifting an int further than 32 bits is UB.
- I'm assuming only positive inputs.
long
is 8 bytes long on my platform, so it doesn't overflow.
I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.
CodePudding user response:
Widening after the shift reduces your loop to naive repeated subtraction
It's not the run-time of cdqe
or movsxd
vs. mov
that's relevant, it's the different starting values for your loop, resulting in a different iteration count, especially for pathological cases.
Clang without optimization compiled your source exactly the way it was written, doing the shift on an int
and then sign-extending the result to long
. The shift-count UB is invisible to the compiler with optimization disabled because, for consistent debugging, it assumes variable values can change between statements, so the behaviour depends on what the target machine does with a shift-count by the operand-size.
When compiling for x86-64, that results in long j = (long)(y<<0);
, i.e. long j = y;
, rather than having those bits at the top of a 64-bit value.
x86 scalar shifts like shl eax, cl
mask the count with &31
(except with 64-bit operand size) so the shift used a count of 32 % 32 == 0
. AArch64 would I think saturate the shift count, i.e. let you shift out all the bits.
Notice that it does a 32-bit operand-size shl eax, cl
and then sign-extends the result with cdqe
, instead of doing a sign-extending reload of y
and then a 64-bit operand-size shl rax,cl
.
Your loop has a data-dependent iteration count
If you single-step with a debugger, you could see the local variable values accurately. (That's the main benefit of an un-optimized debug build, which is not what you should be benchmarking.) And you can count iterations.
while (x >= y) {
while (x < j) --i, j >>= 1;
ret = 1 << i, x -= j;
}
With j = y
, if we enter the outer loop at all, then the inner loop condition is always false.
So it never runs, j
stays constant the whole time, and i
stays constant at 32.
1<<32
again compiles to a variable-count shift with 32-bit operand-size, because 1
has type int
. (1LL
has type long long
, and can safely be left-shifted by 32). On x86-64, this is just a slow way to do ret = 1;
.
x -= j;
is of course just x -= y;
, so we're counting how many subtractions to make x < y
.
It's well-known that division by repeated subtraction is extremely slow for large quotients, since the run time scales linearly with the quotient.
You do happen to get the right result, though. Yay.
BTW, long
is only 32-bit on some targets like Windows x64 and 32-bit platforms; use long long
or int64_t
if you want a type twice the width of int
. And maybe static_assert to make sure int
isn't that wide.
With optimization enabled, I think the same things would still hold true: clang looks like it's compiling to similar asm just without the store/reload. So it's effectively / de-facto defining the behaviour of 1<<32
to just compile to an x86 shift instruction.
But I didn't test, that's just from a quick look at the asm https://godbolt.org/z/M33vqGj5P and noting things like mov r8d, 1
; shl r8d, cl
(32-bit operand-size) ; add eax, r8d
CodePudding user response:
I'm keeping this answer up for now as the comments are useful.
CodePudding user response:
int Divide(int x, int y) { int ret = 0, i = 32; long j = y << i;
On most systems, the size of int
is 32 bits or less. Left-shifting a signed integer by equal or higher number of bits as its size results in undefined behaviour. Don't do this. Since the program is broken, it's irrelevant whether it's slower or faster.
Sidenote: Left shifting a signed 32 bit integer by 31 or fewer bits may also be undefined if that shift causes the left most bit to change due to arithmetic overflow.