Fastest way to take the average of two signed integers in x86 assembly?-CodePudding

Suppose we have two register-length² signed¹ integers, say a and b. We want to compute the value (a b) / 2, either rounded up, down, towards zero, or away from zero, whichever way is easier (i.e. we do not care about the rounding direction).

The result is another register-length signed integer (it is clear that the average must be within the range of a register-length signed integer).

What is the fastest way to perform this computation?

You may choose which registers the two integers will initially be in, and which register the average ends up being in.

Footnote 1: For unsigned integers, we can do it in two instructions. This is perhaps the fastest way, although rotate-through-carry is more than 1 uop on Intel CPUs. But only a couple when the count is only 1. An answer on a Q&A about unsigned mean discusses the efficiency.

add rdi, rsi
rcr rdi, 1

The two numbers start in rdi and rsi, and the average ends up in rdi. But for signed numbers, -1 3 would set CF, and rotate a 1 into the sign bit. Not giving the correct answer of 1.

Footnote 2: I specified register-length signed integers so that we can't simply sign extend the integers with a movsxd or cdqe instruction.

The closest I've got towards a solution uses four instructions, one of them an rcr that's 3 uops on Intel, 1 on AMD Zen (https://uops.info/):

add rdi, rsi
setge rax
sub rax, 1
rcr rdi, 1

I think a shorter solution probably lies in combining the middle two instructions in some way, i.e. performing CF ← SF ≠ OF.

I've seen this question, but that's not x86-specific and none of the answers seem to compile to something as good as my solution.

CodePudding user response：

Depending on how we interpret your lax rounding requirements, the following may be acceptable:

sar rdi, 1
sar rsi, 1
adc rdi, rsi

Try on godbolt

This effectively divides both inputs by 2, adds the results, and adds 1 more if rsi was odd. (Remember that sar sets the carry flag according to the last bit shifted out.)

Since sar rounds to minus infinity, the result of this algorithm is:

exactly correct if rdi, rsi are both even or both odd
rounded down (toward minus infinity) if rdi is odd and rsi is even
rounded up (toward plus infinity) if rdi is even and rsi is odd

As a bonus, for random inputs, the average rounding error is zero.

It should be 3 uops on a typical CPU, with a latency of 2 cycles since the two sar are independent.

CodePudding user response：

As an outside answer, consider the pavg family of instructions.

I say "outside", since this is likely not acceptable to you. It assumes the value is unsigned 8-bit or 16-bit and in an SSE register, which of course also requires SSE. I mention it mainly since it is x86's anointed equivalent to averaging instructions in other ISAs.

In its defense, SSE is ubiquitous by now, even guaranteed on x86-64. Also, this instruction is 1 cycle, and actually can do 4 at once if you like. Best of all, unlike your original solutions, it also correctly handles overflow issues.

Note that it's possible to use an unsigned routine to implement a signed routine, though in general correctly accounting for overflow issues is a nightmare. Your current solution appears to already be broken in that regard, though.