From time to time, we'll post
details here of some interesting design solutions. Feel free to use
them or ask for reprint permission.
#1) FAST Base-2
Michael Dunn, Cantares
A recent FPGA-based project involved displaying medical imaging data at a rate of 5 Mpixels/s, with future requirements ranging to 50 Mpixels/s. This isn't that high a rate, but, every pixel's 24-bit value required logarithmic scaling to 8-bit greyscale (essentially converting from a linear to a dB scale).
After some online and literature searching, no suitable algorithms had been found. So, the following one-cycle FPGA solution was developed. Speeds of 200 Msamples/s or more should be achievable.
In a nutshell, the problem is broken into two main steps. We effectively derive the "integer" and "fractional" parts of the result separately, though one would normally interpret the result as a straight binary integer.
Step 1: Compute the integer part using a priority encoder to determine the most significant '1' bit of the input. In this example, we only encode the top 16 input bits, giving us a 4-bit value that becomes the upper nibble of the result.
Step 2: Use this 4-bit value (inverted) to drive the shift control of a barrel shifter whose input is the full 24-bit input value. This normalizes or subscales the input; i.e., the most significant '1' bit of the input is now the msb of the barrel-shifter output (unless that '1' bit is in the lowest 8 bits of the input). We can now resort to the quick-and-dirty expedient of using a lookup table to find the log fraction. The table's 4-bit output becomes the lower nibble of the result.
Keep in mind that certain input widths will result in many missed codes at the low end of the range. For example, as a 16-bit input changed from 1 to 2, the output would jump from 0x00 to 0x10. This example design needs at least 21 bits for full accuracy.
To achieve maximum speed (at the expense of latency), sprinkle pipeline registers as required.
The LUT should have at least one more address bit than its data width to eliminate missing codes. Larger sizes will improve the accuracy of the output transition points.
If you have free multipliers in your FPGA, it
may require fewer resources overall to use one in place of the
barrel-shifter. But then you can't use the encoder output directly.
You'll need to decode it to
generate a 16-bit value.
For best accuracy, use an adder to
combine the LUT and encoder outputs instead of simply taking them
directly. Why? When pre-computing the LUT entries, you'll find that one
or more at the top end have a computed value that should ideally be
rounded up. You can ignore
this, round down, and forget the adder. Or, you can add one bit
to the LUT's data width, round correctly, and use the adder to combine
the (in this example) 5-bit LUT output and the 4-bit encoder output.
This also means that the output will carry
to the ninth bit at some point, so you must decide whether
to use this, or to simply clip the output when this happens. Or perhaps
the input is known to never go high enough to cause this.
LUT Formula (e.g., for a 32x4
f(x) = log2(1 + x/32) *
16 ; x=0-31
Oh, and finally, log(0) is just,
Update: Several versions of the Log code have been posted to the well-respected Opencores site, as well as a newly developed Antilog design. Combine the two, and you'll have a single-cycle square-root!