I'm trying to patch an old 8-bit assembler routine (it happens to be M6800 but this isn't really machine specific) to take a 16-bit value and round down to the nearest 0x50 (dec 80) multiple. The current code truncates to nearest 32 by just doing one AND 0xE0
to the low byte which of course neatly zeroes out the low 5 bits and gets the correct result without touching the high byte.
This is doing some screen math and so the input values will only be in the range of 0xE000
to 0xE7CF
. Since 80 is obviously not a power of two, I can't do it trivially, but given that this is a patch I'm trying to keep the number of instructions to a minimum ideally without adding generic division or lookup tables, etc.
I'm stumped and suspect there is no especially clever way to accomplish this. 80 is not a power of two but it is a multiple of 16... doesn't that help me at all?? Any insights, pointers, or etc are appreciated. Thanks.
CodePudding user response:
First of all, since 80 = 16 * 5
, rounding down to a multiple of 80
means rounding down to a multiple of both 16
and 5
. The first one is easy with a right shift, so now we're left with the mod 5
part:
def mod5(x):
return x % 5
def round80(x):
x >>= 4
x -= mod5(x)
x <<= 4
return x
mod5
isn't that easy to do, but there is a clever construction for Mersenne moduli that first calculates mod15
, and then brings the value down to be modulo 5. It's not that intuitive at first, but it only involves additions and shifts, which should be easy enough to implement. Here's the python version:
def mod15(x):
x = (x >> 8) (x & 0xFF)
x = (x >> 4) (x & 0xF)
if x >= 15: x -= 15
if x >= 15: x -= 15 # (see note)
return x
def mod5(x):
x = mod15(x)
if x >= 5: x -= 5
if x >= 5: x -= 5
return x
def round80(x):
x >>= 4
x -= mod5(x)
x <<= 4
return x
To be safe, I verified this code for all possible values:
for i in range(0x10000):
trivial = i - (i % 80)
assert trivial == round80(i)
One additional note: the second if
inside of mod15
can actually be omitted for your input range, commenting it out made no difference. If you need the whole [0,0xffff]
range, you cannot remove it.
I'm no expert in M6800 assembly so I'm not going to attempt writing the final code, but it should be relatively simple, especially given that the only 16 bit operations are the two shifts and the first addition in mod15
.
No division, no multiplication and no lookup tables - I hope this is short enough for your needs!
CodePudding user response:
This isn't clever math or bit-twiddling, but you might consider a simple loop if you're optimizing for code space. With your input range there are only 26 possible output values so the worst-case runtime isn't terrible.
#include <stdio.h>
#include <stdint.h>
#include <assert.h>
/* Truncate to nearest multiple of 80, with input range 0xE000 to 0xE7D0. */
uint16_t truncate_to_80(uint16_t n) {
uint16_t ret = 0xDFC0; /* Lowest possible output. */
while (ret 80 <= n) {
ret = ret 80;
}
return ret;
}
int main(void) {
for (int i = 0xE000; i <= 0xE7D0; i ) {
uint16_t truncated = truncate_to_80(i);
assert(truncated == (i - (i % 80)));
printf("%d -> %d\n", i, truncated);
}
}
CodePudding user response:
0x50 isn't a power of 2, so it has odd prime factors. That makes it fundamentally harder on a binary computer. Quotient and remainder both depend on all the higher bits of the whole integer.
Dario's mod5 idea taking advantage of the 2^n 1 special case is quite useful, avoiding the full general case of a multiplicative inverse or a shift/add iterative division.
It's still somewhat painful, especially since 6800 (unlike AVR) can only shift by 1. AVR's swap
instruction (exchanging nibbles = rotate by 4) is useful here, with compilers making good use of it to shift by 4, for a C version of Dario's code (https://godbolt.org/z/c7qhKx6aY).
AVR is another 8-bit microcontroller, so it's somewhat interesting to see how compilers do things there. It has 32 registers, vs. 6800's two (plus a 16-bit IX, but operations on that are quite limited: http://www.8bit-era.cz/6800.html lists the instruction set). It only has shifts by 1, nothing like AVR's swap
that I can see, so shifting by 4 is more expensive. (Shifting by 8 is still free, 16-bit numbers are still stored in 8-bit halves.)
I introduced new uint8_t
variables where values were narrow enough for that, helping compilers avoid wasting instructions. (And identifying those spots for a hand-written asm version.) I also rewrote things to potentially do less shifting, especially less full 16-bit shifting, but I think there's more room for such optimizations by hand, e.g. maybe working with shifted values to return a mod5 << 4
directly, instead of making the caller do that.
e.g. (x >> 4) (x & 0xF)
could potentially be (x & 0xf0) ((uint8_t)x<<4)
, but that wouldn't make it fit within a byte; there's still be a bit that might extend into the next byte. Perhaps only right shift one or two bit positions, so there's less distance to shift back after using 8-bit subtraction to do the m -= 15
and m -= 5
steps.
Otherwise at least the first of those steps would have to deal with the top of the value extending into another byte. Borrow only propagates from low to high, but you do need to compare correctly so you can't just truncate and discard it; that would be mod 16 not mod 15.
#include <stdint.h>
inline
uint8_t mod15(unsigned short x){
x = (x >> 8) (x & 0xFF); // carry-out can produce a 9-bit result
uint8_t m = (x >> 4) (x & 0xF);
#if 0
do {
m -= 15;
} while((int8_t)m >= 0);
m = 15;
#else
if(m >= 15) m -= 15;
//if(m >= 15) m -= 15; // not needed for partial range
#endif
// or slower, just let a mod5 loop run potentially more iterations
return m;
}
inline
uint8_t mod5(unsigned short x){
uint8_t m = mod15(x);
//while(!__builtin_sub_overflow(m, 5, &m) ){}
//m = 5;
//uint8_t m1 = m - 5;
//if (m < 5) return m1;
#if 1
do { // GCC makes a small loop, clang calls a modulo function :/
m -= 5;
} while((int8_t)m >= 0);
m = 5;
#else
if(m >= 5) m -= 5;
if(m >= 5) m -= 5;
#endif
return m;
}
unsigned short round80(unsigned short x){
//x >>= 4;
uint8_t m5 = mod5(x>>4);
x &= -16;
x -= m5 << 4;
//x <<= 4;
return x;
}
I haven't yet actually tried to write a 6800 version of it; it would very likely need some scratch space, either on the stack with push / pull, or some cheaper to access space somewhere. With only 2 8-bit registers A and B, that's the whole value.
16-bit right shift by 1 can be done with LSR B
/ ROR A
. That does need to happen at some point I think, but perhaps can be avoided as much as possible if it's going to be un-done later.