I am trying to make my microcontroller library more efficient and the most called function in the library (can be called 1 million times a second) is where im focusing my efforts into.
In this function there are some bitshift operation. like this where bool type
is coverted to 16bits and shifted 10 places
value = ((uint16_t) type) << 10
How many clock cycles / assembly instructions does a shift 10 uses?? Is it 10 instructions? or is it done in a single instruction?
As an alternative to that i am considering of doing something like this instead
if(type)
value = 0x0400
else
value = 0x0000
now if << 10
does take 10 operations i would assume that the later would be slightly faster by a few operations (the flash space trade off is negligible)
The program library is used in both C and C languages and the target microcontroller architecture is ARM and RISC
Bonus Points : Are there tools for answering questions like this where i want to test the whole function on how many assembly instructions it compiles into? for further optimatiation. Of course without digging into the whole compiled hex files
CodePudding user response:
It depends on the cpu and the compiler. You can experiment with the Compiler Explorer. On x86_64, it's a single instruction.
sal eax, 10
https://gcc.godbolt.org/z/bbx45od8W
CodePudding user response:
I hope this function to be a static inline
one or even a macro. Otherwise, the function call overhead will meaninglessly devastate the performance beyond rescue.
And let's focus on arm since you put the arm
tag:
Remove the typecasting if you are 100% sure type
fits into 16bits : typecasting is usually 'free of charge', but depending on the context and toolchain, the compiler could put some masking instructions such as uxth
prior to bit operations that eat up cycles for nothing.
Frankly said, you shouldn't write any separate function for simple barrel-shifter operations in the first place: simply embed them in your code instead.
Barrel shifters are DEFINITELY not a concern on ARM, and they are never slower than any other ALU instructions.
You should worry about other things many people aren't even aware of: bitness of local variables.
I fear the variable value
is of type uint16_t
as well.
Does it HVAE TO BE a 16bit variable?
Basically, ALL local variables SHOULD BE 32bit ones, preferrably unsinged.
ARM ALWAYS does its ALU operations in full 32bit. - there is ZERO gain in using narrower local variables (unlike Motorola 680x0). Quite to the contrary, using narrower local variables will force the compiler to put masking operations for nothing, because ARM can't do it differently.
You SHOULD change all the local variables to 32bit ones at all costs, unless you need the modulo effect (255 1 = 0, not 256) It WILL definitely make a clear difference.
The only exception is int16_t
for multiplications. ARMv5E or later has DSP-like instructions such as smulxy
, smlaxy
that do signed 16bit multiply faster than the 32bit mul
, mla
instructions.