unshifted register required - Assembler throws error on the TST instruction-CodePudding

I am currently rewriting an algorithm from C to arm assembly (ARM Cortex M4 CPU).

What does my code do?

This algorithm takes an 8-bit number as input and starting from the right tells us what is the first bit that’s 0. Here are a few examples:

Input: B01111111 Output:7

Input: B01110111 Output:3

Input: B11111110 Output:0

Here is the original C code that accomplished this:

uint8_t find_empty(uint32_t input_word)
{
  for (uint8_t searches=7; searches>=0; searches--)
  {
    if ((input_word&1)==0)
    {
      return 7-searches;
    }
    
    input_word=input_word>>1;
  }
  return 255;
}

And here is my beginner attempt at rewriting this in ARM (Cortex M4) assembly.

.global findEmpty
findEmpty:
    mov r1, r0 //Move input_word to r1
    
    //Config
    mov r0, #7 //search through 8 (7 1) bits. <-searches

    FindLoop:
      tst r1, #1 //ANDs input_word with 1, sets the Z flag accordingly.
      
      beq NotFoundYet //didn't get a 0, jump forward
        rsb r0, r0, #7 //searches=7-searches <- which bit is 0? 
        bx lr //Return found bit number
      
      NotFoundYet:
      lsr r1, r1, #1 //input_word=input_word>>1

      sub r0, r0, #1 //Decrement searches
      cmp r0, #0
      bpl FindLoop //If searches>=0, do the loop again. 
    mov r0, #255 //We didn't find anything. Return 255 to signal that
    bx lr

Quick note: I used r1 as a variable here, which I heard you are not supposed to do as the compiler (I am linking my assembly “.S” file to a C file with gcc) uses r0-r3 to pass data to and receive data from functions. However, because of that it doesn’t use these registers for important things, so I don’t have to deal with pushing stuff to the stack, which saves cycles.

What’s the problem?

When I try to compile my project gcc gives me an assembler error on the TST line:

Assembler messages: Error: unshifted register required -- `tst r1, #1’

This is very confusing to me, as I’ve looked at the keil site for both the TST instruction and the LSR instruction which I am using later to shift r1 by 1. Yet none of them say anything about not being able to work together. I’ve looked online for other discussions on this topic. I came across this discussion where people were saying to tell the compiler to compile in ARM mode, but my code already is running in ARM mode, not Thumb. I confirmed this by making another .global subroutine and trying to add an immediate over 7 to a number, and sure enough it doesn’t work, like it shouldn’t if the CPU is in ARM mode.

.global illegal_add
illegal_add:
    add r0, r0, #20
    bx lr

I know very little and am out of ideas how to try and tackle this issue. If anybody has any ideas with things to try, please let me know. Thank You for the help.

CodePudding user response：

It is not 100% clear to me what the problem is. Most likely, you forgot to set up the assembly correctly. To fix this, issue these directives at the beginning of the file:

.syntax unified
.cpu cortex-m4
.thumb

If I place these in front of your code, it assembles just fine on my machine.

A few general hints:

read up on which instructions are 16 bit encodable and try to pick instructions from these. 16 bit instructions execute faster and consume less memory. For example, you can use lsrs r1, r1, #1 instead of lsr r1, r1, #1 to get a 16 bit instruction.
be more clever with flag manipulations. Many instructions already set flags for you and if you are clever, you can likely avoid all tst and cmp instructions. For example, if you use subs r0, r0, #1 instead of sub r0, r0, #1 you save a byte (16 bit instruction) and already set the Z flag according to r0, saving you the subsequent cmp instruction.

CodePudding user response：

There is a 32-bit Thumb2 form of tst reg,imm which works if you use the right assembler options and directives. But it's not useful for finding the position of the lowest 0 or 1 bit.

You only need three T32 instructions that execute only once each (no iteration/no CPSR):

mvn     r0, r0 // binary negate the value
rbit    r0, r0 // reverse bit
clz     r0, r0 // count leading zeroes

It will return 8 instead of 255 if no zero is found. You'd be better off writing an inline inline-assembly function since this function is so small and fast that the function call overhead will be bigger than the function itself.

As someone mentioned, you just have to be clever which this particular someone doesn't seem to be.

Or you could use builtins and write the whole function in C:

static inline uint32_t find_empty(uint32_t input_word)
{
    input_word = ~input_word;
    input_word = __rbit(input_word);
    input_word = __clz(input_word);
    return input_word;
}

Your toolchain most probably supports both rbit and clz, you just have to find the right syntax.

You can even easily convert this to x86 (even better since x86 has "count trailing zeroes" directly, although AMD CPUs decode it to 2 uops, probably internally bit-reversing to feed lzcnt like we have to do manually on ARM):

#include <immintrin.h>
static inline uint32_t find_empty(uint32_t input)
{
    input = ~input;
    return _tzcnt_u32(input);
}

If your compiler supports GNU extensions, __builtin_ctz(~input) counts trailing zeros portably, using rbit/clz on ARM, tzcnt or bsf on x86. (But beware that the result is an undefined int value if ~input is all-zero, because of the possibility of using x86 bsf. One way to work around this is (1<<8)|(~input) so there's definitely a bit for it to find.)

Anyway, this will bring a huge performance boost:

no CPSR dependency/corruption
no external dependency
no additional register required
no branch

The compiler will be able to schedule those two-three instrucions fantastically for the reasons above.