Home > OS >  x86_64 best way to reduce 64 bit register to 32 bit retaining zero or non-zero status
x86_64 best way to reduce 64 bit register to 32 bit retaining zero or non-zero status

Time:09-22

I am looking for the fastest / most space efficient way of reducing a 64 bit register to a 32 bit register only retaining the zero / non-zero status of the 64 bit register.

For example:

// rax is either zero or non-zero
popcntq %rax, %rax
// eax will be zero if rax was zero, otherwise it will be non-zero

NOTE: It will not work to just use the 32 bit eax directly, if rax was say 2^61 the zero / non-zero status of eax is not the same as rax

At the moment I'm stuck on finding something better than popcntq (1c tput, 3c latency, 5 byte code size) that will work for all values.

Is there some better clever method?

Edit: Removed edits as they didn't add to the question.

CodePudding user response:

One option is

neg rax         ; 48 F7 D8
sbb eax, eax    ; 19 C0

Remember that neg sets flags like a subtract from zero, so it sets the carry flag iff rax is nonzero. And sbb of a register from itself yields 0 or -1 according to whether the carry was clear or set (thanks @prl for suggesting this in a comment).

It's still 5 bytes, and 2 uops instead of 1. But if my math is right, on Skylake you get 2 cycles latency instead of 3, and throughput of 2 per cycle instead of 1.

CodePudding user response:

Fewest uops (front-end bandwidth):
1 uop, latency 3c (Intel) or 1c (Zen).
Also smallest code-size, 5 bytes.

popcnt  %rax, %rax         # 5 bytes, 1 uop for one port

On most CPUs that have it at all, it's 3c latency, 1c throughput (1 uop for one port). Or 1c on Zen1/2/3 with 0.25c latency. (https://agner.org/optimize/)

On Bulldozer-family before Excavator, popcnt r64 is 4c latency, 4c throughput. (32-bit operand-size has 2c throughput but still 4c latency.) Bobcat has quite slow microcoded popcnt.


Lowest latency (assuming Haswell or newer so there's no partial-register effect when writing AL and then reading EAX, or a uarch with no P6 ancestry that doesn't rename partial regs):
2 cycle latency, 2 uops, 6 bytes. Also the smallest code-size if popcnt (5B) isn't available.

Nate's neg/sbb is about the same as this on Broadwell and later, but 1 byte shorter
  test  %rax, %rax     # 3B, 1 uop any ALU port
  setnz %al            # 3B, 1 uop p06 (Haswell .. Ice Lake)

AL is the low byte of EAX, so AL=1 definitely makes EAX non-zero for any non-zero RAX.

This will cost a merging uop when reading EAX on Sandybridge/Ivy Bridge. Core2 / Nehalem will stall for a couple cycles to insert that merging uop. Earlier P6-family like Pentium-M will fully stall until the setcc retires if a later instruction reads EAX. (Why doesn't GCC use partial registers?)


If you want the upper 32 zeroed, BMI2 RORX to copy-and-shift:
2 uops, 2c latency, 8 bytes

rorx  $32, %rax, %rdx      # 6 bytes, 1 uop, 1c latency
or               
  • Related