I don't understand why peephole optimization is needed? Because the compiler is smart enough to optimise the code? Can you please give me some examples where peephole optimization is needed?
CodePudding user response:
Peepholes are often target-specific.
The may only make sense in terms of target registers (RTL), not IR.
For example e.g. x86 xor eax, eax
instead of mov eax,0
. (What is the best way to set a register to zero in x86 assembly: xor, mov or and?). There'd be no reason to do this in IR and doing it any earlier than the last moment (final code-generation) would obfuscate the fact that the value is zero for other optimizations. Doing that for any machine except x86 would be an anti-optimization (creating a false dependency). OTOH you don't want to leave it too late, or else you might not be able to reorder it ahead of something that sets FLAGS, e.g.
xor eax,eax
cmp ecx, edx
sete al ; boolean 0 or 1 zero-extended to 64-bit RAX
Instead of
cmp ecx, edx
sete al ; false dependency on old RAX
movzx eax, al ; no mov-elimination, extra critical path latency
or
cmp ecx, edx
mov eax, 0 ; less efficient instruction
sete al ; later reads of RAX will have partial-register stalls on P6-family
Or as another example, x86 can multiply by 3, 5, or 9 using LEA to take advantage of the 2-bit shift and add in 2-register addressing-modes. It might be useful for an optimizer to know that this is an efficient building-block, and aim to re-factor things into a multiply by 9, but actually converting a multiply by 10 into (x * 5) * 2
is not how you'd want to do it for targets where (x<<3) (x<<1)
is more efficient (x*10 = x*8 x*2
).
See
- Using LEA on values that aren't addresses / pointers?
- How to multiply a register by 37 using only 2 consecutive leal instructions in x86? - shows how some compiles sometimes miss a peephole optimization, and discusses the tradeoff of
imul
vs. 2xlea
and how modern CPUs with fastimul
make it only worth it to spend at most 2 instructions replacing a multiply, or only 1 if the bottleneck is throughput not latency. Unless you can fold an addition into it like LEA can...