I am looking for the fastest / most space efficient way of reducing a 64 bit register to a 32 bit register only retaining the zero / non-zero status of the 64 bit register.
For example:
// rax is either zero or non-zero
popcntq %rax, %rax
// eax will be zero if rax was zero, otherwise it will be non-zero
NOTE: It will not work to just use the 32 bit eax
directly, if rax
was say 2^61
the zero / non-zero status of eax
is not the same as rax
At the moment I'm stuck on finding something better than popcntq
(1c tput, 3c latency, 5 byte code size) that will work for all values.
Is there some better clever method?
Edit: Removed edits as they didn't add to the question.
CodePudding user response:
One option is
neg rax ; 48 F7 D8
sbb eax, eax ; 19 C0
Remember that neg
sets flags like a subtract from zero, so it sets the carry flag iff rax
is nonzero. And sbb
of a register from itself yields 0
or -1
according to whether the carry was clear or set (thanks @prl for suggesting this in a comment).
It's still 5 bytes, and 2 uops instead of 1. But if my math is right, on Skylake you get 2 cycles latency instead of 3, and throughput of 2 per cycle instead of 1.
CodePudding user response:
Fewest uops (front-end bandwidth):
1 uop, latency 3c (Intel) or 1c (Zen).
Also smallest code-size, 5 bytes.
popcnt %rax, %rax # 5 bytes, 1 uop for one port
On most CPUs that have it at all, it's 3c latency, 1c throughput (1 uop for one port). Or 1c on Zen1/2/3 with 0.25c latency. (https://agner.org/optimize/)
On Bulldozer-family before Excavator, popcnt r64
is 4c latency, 4c throughput. (32-bit operand-size has 2c throughput but still 4c latency.) Bobcat has quite slow microcoded popcnt.
Lowest latency (assuming Haswell or newer so there's no partial-register effect when writing AL and then reading EAX, or a uarch with no P6 ancestry that doesn't rename partial regs):
2 cycle latency, 2 uops, 6 bytes. Also the smallest code-size if popcnt (5B) isn't available.
Nate's neg
/sbb
is about the same as this on Broadwell and later, but 1 byte shorter
test %rax, %rax # 3B, 1 uop any ALU port
setnz %al # 3B, 1 uop p06 (Haswell .. Ice Lake)
AL is the low byte of EAX, so AL=1 definitely makes EAX non-zero for any non-zero RAX.
This will cost a merging uop when reading EAX on Sandybridge/Ivy Bridge. Core2 / Nehalem will stall for a couple cycles to insert that merging uop. Earlier P6-family like Pentium-M will fully stall until the setcc
retires if a later instruction reads EAX. (Why doesn't GCC use partial registers?)
If you want the upper 32 zeroed, BMI2 RORX to copy-and-shift:
2 uops, 2c latency, 8 bytes
rorx $32, %rax, %rdx # 6 bytes, 1 uop, 1c latency
or