Which set of instructions has better performance?-CodePudding

I had an assignment from my professor and one part of it sparked a conversation on branchless programming. The goal was to convert this C code to MIPS assembly (assuming a and b were in registers $s0 and $s1, respectively) using the slt instruction.

if (a <= b)
  a = 0;
else
  b = 0;

The expected response was:

        slt $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
        bne $t0,$0,else    ; Branch on not-equal (jump to `else` if $t0 != $0)
        add $s0,$0,$0      ; $s0 = $0   $0, aka $s0 = $0 * 2
        j   done           ; Unconditional jump to `done`
else:   sub $s1,$0,$0      ; $s1 = $0 - $0, aka `$s1 = 0`
done:

However, I submitted the following branchless solution (from what I understand, branchless programming is preferred in terms of performance):

        slt  $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
        mul  $s0,$s0,$t0    ; $s0 = $s0 * $t0 (truncated to 32 bits)
        xori $t0,$t0,0x1    ; $t0 = XOR( $t0, 1 )
        mul  $s1,$s1,$t0    ; $s1 = $s1 * $t0 (truncated to 32 bits)

I understand this is a small case wherein any performance gain would be negligible, but I am wondering if the line of thinking I used was on the right track.

My professor pointed out that the mul instruction is complex (read: hardware intensive), thus (if implemented in hardware) any performance gains made by using less instructions would ultimately be lost due to that complexity. Is this true?

CodePudding user response：

Its impossible to say as you haven't specified a MIPS implementation, which there have been a lot of over the years.

However, multiplication can be more expensive, especially on earlier MIPS processors, where each could cost two (or three) cycles.

An alternative along the lines of unconditional control flow is to subtract one from the slt and use and instead of mul, later the xor with -1. That will cost one extra instruction over the mul version, so unclear as to which will be faster, but does involve "simpler" instructions.

    slt  $t0,$s1,$s0    ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
    subi  $t0, $t0, 1   ; 1/true->0, 0/false->-1
    and  $s1,$s1,$t0    ; $s1 &= $t0-1        mask is either 0 or -1
    xori $t0,$t0,-1     ; $t0 = ~ $t0
    and  $s0,$s0,$t0    ; $s1 &= ~ (t0-1)     mask is either -1 or 0

Whether this is faster than the conditional logic depends on the processor, and the data set, assuming this is in a loop (if not in a loop at some level, then the question is of lessor importance).

The data set will determine whether the branch prediction is effective or not. Each time the branch prediction is wrong, it will cost a couple of cycles (of course depending on the processor implementation). If the branch prediction is 100% effective, then the conditional logic would probably be best. But if the data set defies branch prediction then several cycles of overhead per execution might be incurred, so the unconditional logic might be best.

It would be nice if processors had a few extra opcodes so that the subtraction of 1 was unnecessary, then also the inversion for the else part. In such hypothetical case, we would have:

slt $t0, $s1, $s0
sameIfTrueOrZero  $s0, $s0, $t0   # $s0 = ($t0 ? $s0 : 0)
sameIfFalseOrZero $s1, $s1, $t0   # $s1 = ($t0 ? 0 : $s1)

These would be simple instructions for hardware to implement — they would not tax the ALU, or instruction and operand encodings.

Some more advanced processors may be able to fuse operations, or dual issue, on some of the above so that the execution cost might come down.