as part of a project that generates x86-64 machine code at runtime, I very often have the need to copy a bit from one register to another register at another bit position.
I came up with this code (example that copies bit 23 of a source register to bit 3 of a destination):
bt eax, 23 ; test bit 23 of eax (source)
setc ebx ; copy result to ebx (ebx is guaranteed to be zero before)
shl ebx, 3 ; shift up to become our target bit 3
and ecx, 0xFFFFFFF7 ; remove bit 3 from ecx (target)
or ecx, ebx ; set new bit value
Given that I need five instructions just to copy one bit of one register to another bit of another register, I thought if there is something that uses less instructions on x86?
I've read a bit on BMI instructions but unfortunately they do not offer bit extraction using immediates.
CodePudding user response:
Alternative:
rcr ecx,3 1
bt eax,23
rcl ecx,3 1
CodePudding user response:
Number of instructions is not a good performance metric. Either code-size in bytes (x86 machine instructions are variable length), or number of front-end uops (after decoding) and/or latency / throughput on existing mainstream CPUs are relevant optimization goals.
rcr
/ rcl
are unfortunately very slow, especially with count != 1.
Four single-uop instructions are much faster, including a BMI2 rorx
to copy-and-rotate the bit you want to insert to the right position in a tmp register, then and
to isolate it. Or a normal shift/and if you don't need to preserve the input. That's more efficient than anything I've thought of bouncing it through CF.
This is more efficient than xor
-zero/bt
/setc
/shl
. It also avoids partial-register false dependencies or stalls: setc ebx
doesn't exist, only setc bl
(or setc bh
). It also means that if you can destroy your input register instead of using a temporary, you don't need anything inefficient like setc bl
/ movzx ebx, bl
which would put the zero-extension on the critical path for latency, and defeat mov-elimination.
I switched to EDX as a temporary because it's call-clobbered in normal calling conventions.
; input in EAX, output in ECX
; no pre-conditions necessary
; unlike the original which doesn't count the cost of zeroing EDX
rorx edx, eax, 23-3 ; shift bit 23 to bit 3. Use SHR if you don't need the EAX value later
and edx, 1<<3 ; and isolate
and ecx, 0xFFFFFFF7 ; clear it in the destination
or ecx, edx ; and merge
; total size: 14 bytes of machine code for imm8 masks, 20 for imm32 masks
; 4 uops.
The or
could instead be an add
or lea
if you want because we know there are no overlapping 1 bits. lea
being useful as a copy-and-merge in case you want the result in a different register. But if you want that, you'd just or
into the temp reg instead of ECX, and you can choose any temp reg you want, including EAX (in which case you could optimize rorx
to shr
.) add
would be useful in case you want to branch on FLAGS from it, since it can macro-fuse with some forms of jcc
on Sandybridge-family. xor
would also work but has no advantages, and isn't idiomatic for human readers.
These are all single-uop instructions on Intel and AMD, and your destination bit-position was low enough that both AND masks can fit in a sign-extended-imm8 so the AND instructions are 3 bytes each. (Instead of 6 for and r/m32, imm32
). rorx
is 6 bytes, though, with a VEX prefix and imm8. The total size is 14 bytes, or 20 bytes if the destination bit is outside the low 7. (Or low 8 if you use byte operand-size like and dl, 0x80
/ ... / or cl, dl
which causes partial-register problems on P6-family but is fine elsewhere.)
(The instructions used in the question are also single-uop, including bt
. On AMD CPUs bts
and so on are 2 uops, but bt
is just 1.)
With a higher destination bit-number, you could save size using btr ecx, 30
(4 bytes, still 1 uop on Intel) instead of and ecx, ~(1<<30)
(6 bytes, or 5 bytes into EAX). But that costs an extra uop on AMD.
Of course if you care about code-size, you'd mov edx, eax
/ ror edx, 23-3
(5 bytes total) instead of using rorx
(6 bytes). So that's a total of 17 bytes with a high destination bitpos. Or 15 if we can destroy EAX.
If the bit positions were runtime-variables, this would be less efficient, needing a variable count shift. (And some subtraction or something to generate shift counts.) A different strategy might be better there.
Another way to exchange bits between registers is with masked XORs, but that's not more efficient here where we don't want to swap, just go one way. And we can use the inverted mask as an immediate. (Or if in a register, with BMI1 andn
.)
- Merge bit sequences a and b according to a mask
- How to use bitwise operations to assign specific bits of a byte according to two other bytes? (bit blend according to a mask)
- Swapping bits at a given point between two bytes
The main problem is that x86 lacks a bitfield insert. Extract with shift/and is easy enough, although that's still 2 instructions unless you have AMD TBM for the immediate form of bextr
, using the XOP encoding. (Bulldozer-family only.) Or use pext
if you already have a constant in a register. It would be neat if there was an inverse of bt
that deposited CF at the specified position, but there unfortunately isn't.
If there was a bitfield insert instruction, either from CF or from the low bit of another register, you wouldn't need to mask.
Avoiding a temp reg without rcr
/rcl
- only slightly slower, still 4 uops
@rcgldr shows an interesting trick using rcr
/rcl
which is good for code size, but unfortunately slow on modern CPUs where rcl
and rcr r32, imm
are many uops, e.g. 7 uops on Zen3 with 3c throughput, and 8 uops on Sandybridge-family (including Alder Lake) with 6c throughput. (https://uops.info/ / https://agner.org/optimize/)
That's 10 bytes of machine code and doesn't need a temp register. We can duplicate the functionality with only 12 bytes of machine code, but still only 4 single-uop instructions, assuming a modern-enough x86. The above version will be faster on Intel Haswell and earlier.
This has a longer critical-path latency from ECX->result (3 cycles), but same for EAX->result (3 cycles), assuming single-uop adc
and rotates. Also more of the uops compete for the shift units, so can't run on as wide a choice of back-end ports. Whether this matters depends on the surrounding code.
It's the same code size even when the destination bit-position is > 8, and for 64-bit mode avoids needing any 8-byte masks.
If you don't have a spare register you can clobber (including the input), this is very likely worth it. Or just for code-size, if this isn't a really hot part of your code.
;; 12 bytes total. More latency through ECX, and some uops have fewer ports to choose from
ror ecx, 3 1 ; 1 uop on Intel HSW and later, and AMD
; the bit to be replaced is now at the top of ECX, and in CF
bt eax, 23 ; 1 uop
adc ecx, ecx ; 1 uop on Broadwell and later, and AMD.
; Shift in the new bit, shifting out the old bit (into CF in case you care)
rol ecx, 3 ; 1 uop on HSW and later, and AMD
; restore ECX's bits to the right positions, overwriting CF
The initial rotate-right can be rcr
or ror
; we don't care whether the bit we want to replace is shifted into the top bit temporarily, or only into CF. ror
is much faster.
We basically emulate rcl ecx, 3 1
with rcl ecx, 1
and then rol ecx, 3
. It differs in FLAGS output I think, but matches in ECX result and how it reads from FLAGS.
And then replace rcl r32, 1
with the equivalent but faster adc same,same
; they differ only in the FLAGS output. adc
doesn't have any weird partial-flags writing (leaving most of SPAZO unaffected) that makes rotates more expensive on Intel. adc
was 2 uops on Intel before Broadwell, but has been 1 uop on AMD for a long time.
This goes through FLAGS using bt
so it can easily support a runtime-variable source bit position. For a variable destination bit-position, you'd have to calculate shift counts, and ror reg, cl
is slower (3 uops on Intel). Unfortunately there's no variable-count rorx
, only shlx
/ shrx
.