Recently came across the legacy 0x66 operand size override.
Could it be used to align instructions without explicitly writing a single/multi-byte NOP instruction?
For example, adding the align 16 instruction
int 3
mov rax,1
align 16
add rcx,rax
generates this dissassembly
...1000 cc int 3
...1001 48c7c001000000 mov rax,1
...1008 0f1f840000000000 nop dword ptr [rax rax] ; <--- multi-byte NOP instruction
...1010 4803c8 add rcx,rax ; <--- 16-byte aligned
Removing align 16 and prepending mov rax,1 with repeated 0x66 ignored bytes
int 3
db 8 DUP (066h)
mov rax,1
add rcx,rax
generates this dissassembly
...1000 cc int 3
...1001 666666666666666648c7c001000000 mov rax,1
...1010 4803c8 add rcx,rax ; <--- 16-byte aligned
Is the 0x66 alignment technique valid and faster than using align 16?
UPDATED
As suggested, it works using the CS segment override 0x2e. Tested with NASM
nop
CSbuf: times 8 db 02eh
mov rax,strict dword 1
add rcx,rax
and the add rcx,rax was 16-byte aligned
00007ff7`f78d1010 4801c1 add rcx,rax
Built using these commands
nasm -fwin64 test.asm
link.exe /subsystem:console /machine:x64 /defaultlib:kernel32.lib /defaultlib:user32.lib /defaultlib:libcmt.lib /entry:main test.obj
CodePudding user response:
The basic idea is a good one, padding earlier instructions using prefixes is a cheaper way to align than using even one multi-byte NOP. (A NOP takes a slot in the decoders, in the uop cache, and in the issue/rename stage, the narrowest part of the pipeline. Also a ROB entry to track it until retirement).
That's why Intel recommends lengthening other instructions in inner loops when working around the performance pothole introduced by their microcode update for the JCC erratum.
Assemblers always should have been doing this for align
directives (or with some other mechanism to specify which instructions to pad), but nobody's been motivated enough until Intel's JCC erratum with an official recommendation to do it this way. (Because unlike aligning tops of loops, the padding may have to be inside an inner loop, where it would cost front-end bandwidth every iteration if it was a NOP instead of part of other instructions.) Unfortunately this JCC-erratum mitigation has mostly been added as a separate feature by assemblers (e.g. GAS's -mbranches-within-32B-boundaries
) or left to compilers, not providing a general way to avoid wasting instructions on NOPs for alignment of other points.
Your specific choices aren't ideal
See What methods can be used to efficiently extend instruction length on modern x86? - more than 3 prefixes on one instruction can create big slowdowns for some CPUs, including Silvermont-family like the E-cores in Alder Lake. So spread this out over multiple instructions in the basic block before the position you want aligned. Prefer padding later instructions so the front-end can get more instructions decoded earlier, unless you have a lot of very-short instructions (like 1 or 2 bytes each) that are short enough that a 16-byte fetch block could still include more than 5 or 6. (In anticipation of future CPUs with wide legacy decode, if there aren't any already.)
Using 7-byte mov rax, 1
in an assembler that doesn't optimize it to 5-byte mov eax, 1
is already one way to fill some bytes; 10-byte mov rax, strict qword 1
(NASM syntax; IDK if MASM can force an imm64) is another way to use more bytes. On Sandybridge-family, a 64-bit immediate fits efficiently in the uop cache (1 entry without needing an extra cycle to read it) when the 64-bit value isn't huge, i.e. is just the sign-extension of a 32-bit value. (https://agner.org/optimize/microarchitecture.pdf - Sandybridge chapter)
ds
or cs
prefixes are a good choice, as they have meaning for most opcodes so it's unlikely that cs mov eax, 1
would be repurposed as the encoding for some different instruction. (e.g. the way rep bsr
is the encoding for lzcnt
, which does something different.) It's not impossible, especially for the no-modrm mov-to-register opcodes (mov r32,imm32
or mov r64,imm64
, unlike the mov r/m64, sign_extended_imm32
you're using. https://www.felixcloutier.com/x86/mov)
I wouldn't recommend using prefixes like 66h
that could potentially cause LCP pre-decode stalls on Intel CPUs, and even change meaning of many instructions. (Without REX.W setting the operand-size to 64-bit, it would change the meaning for mov eax, 0x00000001
to mov ax, 0x0001
with a 00 00
left over which decodes as add [rax], al
.)
I'm not 100% sure it's well defined on paper what's supposed to happen with both a 66h
and a REX.W
prefix. (@fuz questioned this in comments). A 67h
address-size prefix is generally fine in 64-bit mode, or a segment override prefix is also good in 64-bit mode.
In practice on Skylake, the REX.W prefix wins and the 66h is ignored, and doesn't even cause false LCP stalls. But I wouldn't count on that on P6-family, and if it's not documented on paper what should happen with both 66h and REX.W, I'd worry about other vendors, or especially emulators and dynamic-translation software for that corner case.
Fun fact: Sandybridge-family doesn't LCP-stall on mov
in general, but does on other instructions when a 66h
prefix changes an opcode from having an imm32
to an imm16
.
I just tried this, with NASM times 6 db 0x66
/ mov rax, strict dword 1
(to match your encoding; NASM normally optimizes it to the architecturally equivalent mov eax,1
). I put that inside a %rep 8000
block (to defeat the uop cache). It ran at 1.2IPC on Skylake, with no counts for ild_stall.lcp
in perf stat
.
Even with add rax, strict qword 1
(to force the add rax, sign_extended_imm32
no-modrm encoding), Skylake doesn't LCP stall, running at 1.0 IPC (bottleneck on latency). Same for add rcx, strict qword 1
for the imm32 encoding with a ModRM.
I wouldn't recommend a 66h prefix, but it happens not to break correctness (with a REX.W) or performance on Skylake. I didn't test on any other CPUs or emulators, and I'm not claiming this use of 66h
is safe anywhere else. (Although it probably is on earlier Intel CPUs, at least for correctness if not performance.)