I want to create a schematic that divides any 8-bit number by 3, on a Xilinx device in case that matters.
For example, hardware takes two inputs (111101) and (11) and returns the division of two numbers which is 010100.
I don't need to worry about remainder- just need quotient
CodePudding user response:
A good compromise might be to use a bunch of look up tables. Pseudo code looks like this:
# look up table for the high part
# returns 7 bit quotient, 2 bit remainder
# in case you have 8 bit luts, you can save
# one LUT by replacing the high bit with x[2]&x[3]
lut_hi(x[4]) =
0000 => 0000000:00
0001 => 0000101:01
0010 => 0001010:10
0011 => 0010000:00
0100 => 0010101:01
0101 => 0011010:10
0110 => 0100000:00
0111 => 0100101:01
1000 => 0101010:10
1001 => 0110000:00
1010 => 0110101:01
1011 => 0111010:10
1100 => 1000000:00
1101 => 1000101:01
1110 => 1001010:10
1111 => 1010000:00
# look up table for the low part
# returns 3 bit quotient, 2 bit remainder
# you can once again save a bit by replacing
# the high bit with x[2]&x[3]
lut_lo(x[4]) =
0000 => 000:00
0001 => 000:01
0010 => 000:10
0011 => 001:00
0100 => 001:01
0101 => 001:10
0110 => 010:00
0111 => 010:01
1000 => 010:10
1001 => 011:00
1010 => 011:01
1011 => 011:10
1100 => 100:00
1101 => 100:01
1110 => 100:10
1111 => 101:00
# combine two remainders into
remainders(a[2], b[2]) =
return a[1]&b[1] | (a[0]|b[0])&(a[1]|b[1])
# divide by 3: look up the two halves, sum them, apply carry
div3(x[8]) =
q_lo:r_lo = lut_lo(x[0:3])
q_hi:r_hi = lut_hi(x[7:4])
return q_lo q_hi remainders(r_lo, r_hi)
This code works by dividing the number into two nibbles, dividing each nibble by 3 and then summing the quotients.
CodePudding user response:
At least two approaches are suitable for HW implementation;
- reciprocal multiplication
The reciprocal multiplication is computed from (x*K)>>N
, where in this case K
could be ceil(512/3) = 171 0b10101011 = 9 * 19, N = 9
. The factorisation helps, since a number is easily multiplied both by 9 = 0b1001 and 19 = 0b10011. Multiplying by 9 is done by one addition and a free shift, multiplying that by 19 is done by 2 free shifts (wiring) and 2 additions. Total cost == 3 additions.
- (non)-restoring division
Just as learned in elementary school, a long division can be easily converted to HW circuit.
00101001 = 41
11 -> trial "subtraction" fails, result bit = 0
00101001
11 -> trial subtraction fails, result bit = 0
00101001
11 -> trial subtraction fails = 0
00101001 -> pass = 1
(011)
------
10001 -> pass
11
-----
101 -> fail
11
-----
101 -> pass
11
Result = b0001101 = 13
One does not need to compute the actual trial subtractions, since the divisor is a constant. There's instead a quick expression Passed = top|(mid & bottom)
. And likewise one can form logical expressions for the three outputs at each stage.
Starting with top = 0
, mid = input:7
, bot = input:6
, one needs to iterate over 7 positions, computing R=Result,Top,Mid from top
, mid
, bot
.
The logical expressions for each stage are then
top mid bot R T M B
--------------------------
0 0 0 0 0 0 R = t mb
0 0 1 0 0 1 T = ~bm tb
0 1 0 0 1 0 M = ~bt ~t~mb
0 1 1 1 0 0 B = next bit from input
1 0 0 1 0 1
1 0 1 1 1 0
1 1 0 x x x
1 1 1 x x x x = don't care as impossible
- 256x7 LUT done with 16-bit LUTS
One can also implement 1x256 bit LUT using a hierarchy of 16-bit LUTs. 32-bit LUT = multiplex(bits[4], LUT0(bits[3:0]), LUT1(bits[3:0])); A single 256-entry LUT would require 16 16-bit LUTs and 15 multiplexers per bit, or 1680 LogicElements without advanced building blocks. This assumes that each LE can implement an arbitrary 16-bit LUT (4 inputs, 1 output), as it was customary already in late 90s. A multiplexer fits these contraints, as it's a simply a custom 3-input logic function, where as a 16-bit LUT is a custom 4-input logic function.
- 256x7 bit LUT using memory
Some FPGAs do have dedicated LUT sections, and in those cases a 256x7 bit LUT is probably a good solution. The minimum gate count would be 7 registers ( memory), but I would expect the memory access to bring in plenty of elements as a driver.
- Split LUT by Fuz
Area-wise this is on par with the long division. It has smaller latency, but larger fan out.
Comparison
The area estimates using typical FPGA cells would be something like 9 7 13 cells using the addition method and maybe 7x3 using long division.
--------------- -----------------
| Method | Area |
--------------- -----------------
| 1. reciprocal | ~30 |
| 2. division | 21 |
| 3. 16-bit LUT | 1680 |
| 4. 256x8-LUT | 7-100 Memory |
| 5. Fuz 2xLUT | 22 |
--------------- -----------------
Disclaimer: I would expect some advancement happening on logic synthesizers during the past 20 years -- one can probably generate a good divider using System-C. There might be even a possibility to lay out the logic cells and the connections manually -- with Altera I had no such luck, the logic, the placement and the routing was synthesised automatically from Verilog or VHDL, often with embarrassing results.