Home > other >  Implementing hardware that divides an 8 bit number by 3 (11) in binary
Implementing hardware that divides an 8 bit number by 3 (11) in binary

Time:05-11

I want to create a schematic that divides any 8-bit number by 3, on a Xilinx device in case that matters.

For example, hardware takes two inputs (111101) and (11) and returns the division of two numbers which is 010100.

I don't need to worry about remainder- just need quotient

CodePudding user response:

A good compromise might be to use a bunch of look up tables. Pseudo code looks like this:

# look up table for the high part
# returns 7 bit quotient, 2 bit remainder
# in case you have 8 bit luts, you can save
# one LUT by replacing the high bit with x[2]&x[3]
lut_hi(x[4]) =
    0000 => 0000000:00
    0001 => 0000101:01
    0010 => 0001010:10
    0011 => 0010000:00
    0100 => 0010101:01
    0101 => 0011010:10
    0110 => 0100000:00
    0111 => 0100101:01
    1000 => 0101010:10
    1001 => 0110000:00
    1010 => 0110101:01
    1011 => 0111010:10
    1100 => 1000000:00
    1101 => 1000101:01
    1110 => 1001010:10
    1111 => 1010000:00

# look up table for the low part
# returns 3 bit quotient, 2 bit remainder
# you can once again save a bit by replacing
# the high bit with x[2]&x[3]
lut_lo(x[4]) =
    0000 => 000:00
    0001 => 000:01
    0010 => 000:10
    0011 => 001:00
    0100 => 001:01
    0101 => 001:10
    0110 => 010:00
    0111 => 010:01
    1000 => 010:10
    1001 => 011:00
    1010 => 011:01
    1011 => 011:10
    1100 => 100:00
    1101 => 100:01
    1110 => 100:10
    1111 => 101:00

# combine two remainders into 
remainders(a[2], b[2]) =
    return a[1]&b[1] | (a[0]|b[0])&(a[1]|b[1])

# divide by 3: look up the two halves, sum them, apply carry
div3(x[8]) =
    q_lo:r_lo = lut_lo(x[0:3])
    q_hi:r_hi = lut_hi(x[7:4])
    return q_lo   q_hi   remainders(r_lo, r_hi)

This code works by dividing the number into two nibbles, dividing each nibble by 3 and then summing the quotients.

CodePudding user response:

At least two approaches are suitable for HW implementation;

  1. reciprocal multiplication

The reciprocal multiplication is computed from (x*K)>>N, where in this case K could be ceil(512/3) = 171 0b10101011 = 9 * 19, N = 9. The factorisation helps, since a number is easily multiplied both by 9 = 0b1001 and 19 = 0b10011. Multiplying by 9 is done by one addition and a free shift, multiplying that by 19 is done by 2 free shifts (wiring) and 2 additions. Total cost == 3 additions.

  1. (non)-restoring division

Just as learned in elementary school, a long division can be easily converted to HW circuit.

00101001  = 41
11        -> trial "subtraction" fails, result bit = 0

00101001
 11       -> trial subtraction fails, result bit = 0

00101001
  11      -> trial subtraction fails = 0

00101001  -> pass = 1 
 (011)
------
   10001  -> pass
    11
   -----
     101  -> fail
     11
   -----
     101  -> pass
      11

Result = b0001101 = 13

One does not need to compute the actual trial subtractions, since the divisor is a constant. There's instead a quick expression Passed = top|(mid & bottom). And likewise one can form logical expressions for the three outputs at each stage.

Starting with top = 0, mid = input:7, bot = input:6, one needs to iterate over 7 positions, computing R=Result,Top,Mid from top, mid, bot.

The logical expressions for each stage are then

  top mid bot    R  T  M  B
 --------------------------
  0   0   0      0  0  0      R = t mb
  0   0   1      0  0  1      T = ~bm tb
  0   1   0      0  1  0      M = ~bt ~t~mb
  0   1   1      1  0  0      B = next bit from input
  1   0   0      1  0  1
  1   0   1      1  1  0
  1   1   0      x  x  x
  1   1   1      x  x  x      x = don't care as impossible

enter image description here

  1. 256x7 LUT done with 16-bit LUTS

One can also implement 1x256 bit LUT using a hierarchy of 16-bit LUTs. 32-bit LUT = multiplex(bits[4], LUT0(bits[3:0]), LUT1(bits[3:0])); A single 256-entry LUT would require 16 16-bit LUTs and 15 multiplexers per bit, or 1680 LogicElements without advanced building blocks. This assumes that each LE can implement an arbitrary 16-bit LUT (4 inputs, 1 output), as it was customary already in late 90s. A multiplexer fits these contraints, as it's a simply a custom 3-input logic function, where as a 16-bit LUT is a custom 4-input logic function.

  1. 256x7 bit LUT using memory

Some FPGAs do have dedicated LUT sections, and in those cases a 256x7 bit LUT is probably a good solution. The minimum gate count would be 7 registers ( memory), but I would expect the memory access to bring in plenty of elements as a driver.

  1. Split LUT by Fuz

Area-wise this is on par with the long division. It has smaller latency, but larger fan out.

enter image description here

Comparison

The area estimates using typical FPGA cells would be something like 9 7 13 cells using the addition method and maybe 7x3 using long division.

  --------------- ----------------- 
 | Method        |    Area         | 
  --------------- ----------------- 
 | 1. reciprocal |     ~30         |
 | 2. division   |      21         |
 | 3. 16-bit LUT |    1680         |
 | 4. 256x8-LUT  | 7-100   Memory  |
 | 5. Fuz 2xLUT  |      22         |
  --------------- ----------------- 

Disclaimer: I would expect some advancement happening on logic synthesizers during the past 20 years -- one can probably generate a good divider using System-C. There might be even a possibility to lay out the logic cells and the connections manually -- with Altera I had no such luck, the logic, the placement and the routing was synthesised automatically from Verilog or VHDL, often with embarrassing results.

  • Related